UNIVERSITY OF HAWA® 


LIBRARY 
Volume RQC-9 APRIL,1960 
Follows PGRQC-16 
PRRODICAL 
TABLE OF CONTENTS etre 
— Calculation of Average Failure-to-Failure Time of Equipment. .... M.A. Sinitsa 1 
On Reservation by Replacement Method. ........00 eee M.A. Sinitsa 6 
Some Results of Mathematical Reliability Theory... .......... B.R. Levin 14 
eDiets Re E.J. Breiding 19 
Diagnosis of Equipment Failures ..... J.D. Brulé, R.A. Johnson, and E.J. Kletsky 23 
An Information Theory Approach to Diagnosis (Abstract). ..... .. R.A. Johnson 35 
Reliability of Parallel Electronic Components ........... .H. Walter Price 35 
Evaluation and Prediction of Circuit Performance by a 

DiC NOOMMOEDINOUES coy cain stops es vee kee J. Marini and R. Williams 40 
Reliability Using Redundancy Concepts L. Depian and N.T. Grisamore 53 
How Can We Attain High Reliability of Complex Military : 

Beer CRRCOIIODE Tesi chet oe stesso Ves Rue es eek Morris Halio 61 
pyres PU eT ee ee ra Dana A. Griffin 70 
Criteria for Determining Optimum Redundancy .... . R.E. Barlow and L.C. Hunter 73 
Interval Estimation of Product Reliability by Use of the 

TEL OUMAUUTION. 5 ss, 6 dune bie ond. + wm 0 Boe ve R.E. Schafer 77 
ls Anything New in Reliability? ......5 2002000 eee es W.D. McGuigan = 81 
Contractual Aspects of Reliability . 0.5.4.6 ecw bee ween R.W. Smiley 84 
Reliability Predictions, A Case History ........ R.A. David and W. Wahrhaftig 87 
Contractor Management Looks at Reliability Program Activities. .... W.B. LaBerge 90 
Design Information Interchange Among Co-Contractors........- Martin Barbe 93 
Human Factors in the Attainment of Reliability. ..........20 R.S. Lincoln 97 
Eston tom Liserlest Dotgiia’s cs bee sou 5 oe wo 8 6 8 ee Benjamin Epstein 104 
A Customer Looks at the Reliability Program Activities. ........ H.R. Powell 108 
Correction to "Module Prediction" George Hauser 112 
Announcement ...... Pea at Bay. 6 dae, SMa Fit we Sa 6 3g. 3. <ab ogee es Back Cover 


rK DVS oo 


OUP ON RELIABILITY AND QUALITY CONTROL 


the Professional Group on Reliability and Quality Control isan organization, ; 
of the IRE, of members with principal professional 

interest in Reliability and Quality Control. All members of the IRE are 
eae for Pear eeasenbag eco Sere and will receive all SEED ane 


Annual Fee: $2.00 


Administrative Committee 


P. K. McElroy, Chairman 


E. J, Breiding L. J. Paddison 3 
J. C. McAdam C. M. Ryerson GS 


R. F. Rollman, Secretary 
H. J. Stryker, Treasurer 


. Greer W. X. Lamb, Jr. ow eseents Steen 


WwW 
A. Hill P. K. McElroy © R. L. Vander Hamm 
J 


. Jacobson L. L. Schneider Victor Wouk 
J. R. Somerville 


>» 


IRE TRANSACTIONS ® 
on Reliability and Quality Control 


Published by the Institute of Radio Engineers, Inc., for the Professional 
Group on Quality Control, 1 East 79th Street, New York 21, New York. 
Responsibility for the contents rests upon the authors, and not upon the IRE, 
the Group or its members. Individual copies available for sale to IRE- 
PGRQC members at $2.35, to IRE members at $3.50 and to nonmembers 
at $7.05. 


Copyright © 1960—THE INSTITUTE OF RADIO ENGINEERS, INC. 
All rights, including translation, are reserved by the IRE. Requests for republica- 
tion privileges should be addressed to the Institute of Radio Engineers, 1 East 79th 
Street, New York 21, N. Y. 


PRINTED IN U.S.A. 


680489 


CALCULATION OF AVERAGE FAILURE-TO-FAILURE 
TIME OF EQUIPMENT * 


M. A. SINITSAt 


INTRODUCTION 


The average failure-to-failure time is the most 
convenient criterion of the reliability of various 
radio and electronic equipment. It is known that 
this value varies considerably throughout opera- 
tion, especially in the case of airborne equipment. 
At the present time, calculation of reliability is 
widely made on the basis of the exponential law, 
the use of which is determined by the constancy of 
the average failure-to-failure time, or, and this is 
exactly the same thing, the constancy of the aver- 
age frequency of failures during operation. It is of 
interest to find the law of change of the average 
failure-to-failure time throughout the whole period 
of operation of equipment, expressed through the 
laws of probability distribution of the failure-free 
operation time of components. 


LAWS OF PROBABILITY DISTRIBUTION OF 
THE OPERATION TIME OF EQUIP- 
MENT UNTIL THE FIRST FAILURE 


The probability of damage to equipment not pro- 
vided with reserve devices may be expressed 
through the probability of damage of components, 
as shown in Siforov’s paper [1] to be 


n 
G=)- IT (1 - Fj) 
i=1 
where 
Fj = the probability of damage to the ith com- 
ponent, 


G =the probability of damage to the equipment, 

n =the total number of components in the 

equipment. 

It is here considered that the values of Fj are 
statistically independent. Considering the values 
of Fj as functions of the probability distribution 
of the failure-free operation time of the compo- 
nents, it is possible to determine the function of 


*This Paper was Presented at the Fifth Natl. Symp. on 
Reliability and Quality Control, Philadelphia, Pa., January 
12-14, 1959. 

TUSSR. 


the probability distribution of occurrence of the 
first failure of the equipment as 


n 
G(t) = 1- JT[1- F,(t)]. 
i=1 


(1) 


If all the components of the equipment possess the 
same laws of probability distribution, then 
G(t):=.= {ua RO]? (2) 


The probability density function of the time of oc- 
currence Of the first failure of the equipment can 
be determined by differentiating (1) in t 


n n n n 
g(t) = 5 4(t) - = {f,(t) > F(t} + ” 
i=1 i=1 j=l i=1 
j#i 
n 
{fi(t) = FORK - . (3) 
j,k=1 
j#k#4i 
1 n 
ety 100 11 Fo} 
i=l 
fF 
where g(t) # a6 , £(t) = cas 


These expressions make it possible in principle 
to find the laws of probability distribution of the 
operation time of the equipment until the first fail- 
ure on the basis of the known laws of probability 
distribution of failure-free operation of the com- 
ponents, However, it is extremely difficult to use 
such formulas in practice, because even with n = 4, 
(3) contains 32 terms. True, in the case when 
F(t) <<I it is possible in (1) and (3) to restrict 
ourselves to only the first sums containing the 
terms F(t) and f(t) in the first degree, which is 


‘what was done in the paper of Siforov [1]. How- 


ever, such simplification is not always acceptable 
as it is actually necessary here to fulfill the condi- 
tion nF(t) << 1 andnot merely F(t) <1. Witha 
large number of components and insufficiently low 
values of F(t) the former condition may be vio- 
lated. The problem arises of approximating (1) 


2 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


and (3) by functions more convenient for computa- 
tion and ensuring sufficient accuracy in engineer - 
ing calculations. With this aim in view it is ex- 
pedient to allow certain simplifications in posing 
the problem. First of all, all the n components 
are grouped into a small number (m) of subgroups, 
presuming that the components of each separate 
subgroup possess the same laws of distribution. 
Then (1) may be presented as 


G(t)= 1 -{1- Fy) )"1[1-F2¢) Js... [1- Fr) | ™. 
(4) 


Most of the widely used components of radio 
equipment have an F(t) value above 1000 hours, 
not exceeding 0.01 to 0.02 (3) (z.e., for values of t 
above 1000 hours, F(t) is not more than 0.01 or 
0.02). Only inthe case of such components as magne- 
trons, klystrons and certain types of UHF electron- 
vacuum devices do the functions F(t) noticeably 
exceed the indicated values. However, as the 
number of such components in equipment is usually 
small, it is possible to replace all the binominals 
in (4) by an approximated expression of the type 


es . Then (4) may be given as 


m 
G(t) = 1 - exp t ey 


njFi(t)}. (5) 
i=1 


Differentiating in t, we obtain the probability 
density function of the time of occurrence of the 
first failure of the equipment: 


m m 
g(t) = © njfj(t)exp {- > ni F(t). (6) 
i=1 i=1 


The value of the relative error due to the re- 
placement of a binominal of the type (1 - F)" by 
the exponential e~®F can be evaluated from the 
following expression: 
gti Fyn 

(2=.F)2 


From this formula the curves of the relative 
error for various numbers of components in group 
n are plotted and are shown in Fig. 1. The graphs 
pertain only to (5) and cannot be unconditionally 
applied to (6), as differentiation in itself of the ap- 
proximated function can produce considerable 
errors. 

We shall now evaluate the magnitude of the 
relative error which arises when (6) is applied to 
one group of components. For this we first deter- 
mine the exact value of the function g(t) 


g(t) = nf(t)[1 - F(t)]2-1 


5 = (7) 


APRIL 


Fig. 1—Relative error in exponential approximation to 
power of a binomial. 


and the approximate value, 
e*(t) = nf(t)e DFO | 
and then by formula 


_ gt(t) - alt) 


% g(t) 


we find 
9 DF(t) 


AACS GR 2 


Reducing (7) to a similar form 
o BFi(t) 
[1 - F(t)]2 © 


it can be noted that the relative errors of (5) and 
(6) will be of the same order. The graphs in Fig. 
2 which are plotted from (8) clearly confirm this. 

Thus, our approximation of binominals by ex- 
ponential expressions is quite acceptable as re- 
gards the accuracy of the result obtained and 
sufficiently convenient for practical calculations. 

Eqs. (5) and (6) could also have been derived 
directly from Poisson’s Law 


6 = 1 


Kens 
Gi) = Fe Os 


where G(k) = the probability of damage to K com- 
ponents of a total number of n. The parameter 8 


. equals nF, where F is the probability of damage 


to a component. As applied to the conditions of 
our problem, Poisson’s Law can be presented in 
the following form: 


k 
G(k) = nF) Ba ; 


O0vv2 


0.004 


0006 0008 0,010 0,042 


Fig. 2—Relative error in probability formula resulting 
from use of exponential approximation. 


Usually Poisson’s formula is used in cases where 
F = const., i.e., under constant experimental con- 
ditions. This means that the probability of dam- 
age of K components out of n is a function of 
only n and K. However, we are justified in con- 
sidering several series of experiments. The 
value of F, while remaining constant within one 
series of experiments, can change according to a 
certain law on crossing from one series to an- 
other. Each series of experiments can be made 
to correspond to a certain value of time t, since 
F = F(t). This means that the magnitude G isa 
function not only of n and K but also of time t: 
G = G(n,k,t). Supposing K and n to be constant 


for all series of experiments, we obtain the value . 


of G in the form of a function of time alone. 


k -nF(t) 
cy - REOL 


In the given case we are concerned with the prob- 
ability of failure of the system in the time t, 7.é., 
the probability that in the time t at least one of 
the components will be damaged. 


n 
P(t) = = G,(t). 
k=1 


This value can be calculated more simply through 
the probability of the opposite occurrence 

P(t) = 1 - Q(t) 

Q(t) = Go(t) = eB FCO, 
where 


Ry = es cyere Siw, 


SINITSA: FAILURE-TO-FAILURE TIME OF EQUIPMENT 


z.€., we have obtained an expression for the prob- 
ability of damage in a certain group of compo- 
nents possessing the distribution function F(t). 
The reliability of the equipment is determined by 
the product of these values, 


m 
Qalt) = exp{- = miro}, 
i=1 


from which may be obtained the probability distri- 
bution function of the operation time of the equip- 
ment until the first failure determined by (5). 


m 
Ga(t) = 1 - exp {- z 


nj Fy(t)$. 
et iFi 


The use of Poisson’s asymptotic formula in the 
given case is subject to the same conditions which 
were accepted above, namely, a sufficiently low 
value of F(t) and high value of n. This is why 
evaluation of the error incurred when using Pois- 
son’s formula can also be carried out by the 
method suggested by B. V. Gnedenko [2]. 

On the basis of the obtained laws of probability 
distribution it is possible to find the average time 
of operation until the first failure or, and this 
amounts to the same thing, the mathematical ex- 
pectancy of the failure-free operation time. 


& co 
tise tottidts, 
Oo 


Replacing g(t) in this formula by (6), we obtain 


m 
2 njFy(t)} dt. (9) 


t, = = njJ tij(t)exp {- 
i 2 i=1 


i=l 


Dispersion of the time of operation until the first 
failure can be determined by 


m foe) 4 m 
D[ti]= 5 nj J (t - 1)? £(t)exp {- 5 njFy(t)} dt. (10) 
i=1 ° i=1 


Thus, having the laws of probability distribu- 
tion of damage of components of the equipment 
f(t) or F(t), it is possible to find the laws of the 
probability distribution of the operation time of the 
equipment until the first failure [g(t) and G(t)] 
and the basic numerical characteristics of these 
laws, the mathematical expectancy of the disper- 
sion. 


4 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


THE LAW OF CHANGE OF THE 
AVERAGE FAILURE-TO- 
FAILURE TIME 


The average failure-to-failure time of equip- 
ment can be determined in several ways. Strict 
theoretical determination of this value demands 
an evaluation of the probability of replacement of 
one of the components of a group after t, hours 
of operation of the equipment. Next it is neces- 
sary to evaluate the conditional laws of the prob- 
ability distribution of the time of damage of the 
equipment under the condition that one or another 
of the components is replaced. On the basis of 
these laws, there can be determined a series of 
conditional values of the average failure-to- 
failure time t,. Finally, on the basis of the con- 
ditional values of tz, with allowance for the spe- 
cific weight of each group of components in the 
equipment, it is possible to find the mathematical 
expectancy of the operation time between the first 
and second failures. This method of finding te 
is extremely laborious. Taking into consideration 
that the components in modern radio and electron- 
ic equipment are numbered in thousands and 
sometimes in tens and hundreds of thousands, it 
is possible to overlook the effect of the replace- 
ment of one or several components on the general 
law of probability distribution. Then, the method 
of approximate calculation of the average time of 
operation between the first and second failures 
will boil down to the instant t, is taken as the 
point of departure from which we determine the 
mathematical expectancy of the operation time 
until the second failure tz. When finding the laws 
of distribution for equipment in this case, we do 
not use the function f(t), but the normalized 
probability densities calculated under the suppo- 
sition that at the instant t; all the components 
were intact. 


g@) = 


ti 
1-f f(t)dt 
0) 


(11) 


In Fig. 3 this conditional law of distribution is 
shown by the broken line. 

On the basis of the function f(t) known for 
each group of components, it is possible to find 
the conditional probability densities using (6). 


Dior 
>aeg th 6;(0)d0 


i=l o 


m 
w(6) = > n;;(@)exp {- (12) 
i=l 


Next, it is not hard to find the mathematical ex- 
pectance of the failure-free operation time of the 


te = © mJ 6¢;(6)exp {- Sy J gx(o)a0t do. 
O 


APRIL 


Fig. 3—Change in distribution laws. 


equipment under the condition that at the instant 
t, the equipment was intact. 


m ee) m 6 
(13) 
i=) nO i=l 


When determining the average time of opera- 
tion between the second and the third failure, ts, 
it is similarly possible to normalize the function 

f(t) with respect to the instant of time t+ te 


pe ELF ti + te) 
p (é) oa ‘ti +te 
1- J £(t)dt 


and to find the new conditional law of probability 
distribution, v(é). 

Having obtained in this way a series of values 
for t,, it is possible to plot a graph of the law of 
change of the average failure-to-failure time de- 
pending on the time of operation of the equipment. 

The instant of time t, need not necessarily be 
associated with the instant of occurrence of the 
kth failure. It can obviously be selected at ran- 
dom on the time base. Then the distribution den- 
sity w(é,t,,) must be regarded as the function of 
the two arguments, & and t,;, where t, is the 
instant with respect to which the laws of proba- 
bility distribution of the component failure f(t) are 
normalized, while £& is the current value of 
time from the point of departure t,. The mathe- 
matical expectancy of the operation time between 
two consecutive failures can, in this case, be ex- 
pressed by 


taaas ats | E w(t, + &)dé. (14) 
Here 
tk 
Q(tx) = 1 - - g(t)dt . (15) 


Thus, it seems possible to obtain not simply a 
series of discrete values of t,, but a law of 


1960 SINITSA: FAILURE-TO-FAILURE TIME OF EQUIPMENT A) 


change of the average failure-to-failure time as a 
function of the operation time of the equipment. 

It should be mentioned that the probability den- 
sity function of the failure-free operation time of 
the equipment w(é,t,) can be obtained not only by 
normalizing the functions f(t), but also by nor- 
malizing the initial distribution density of g(t). 
As a matter of fact, the probability distribution 
function for the ith component, counted from any 
instant of time t,, if at the instant tx the com- 
ponent was in order, is equal to 


F,i(tk + &) - Fy(tx) 
1 - Fi(t,) 
The conditional probability of failure-free 
operation of equipment consisting of m groups of 
components can be expressed, using the function 


Fi,(&), 


Fj, (6) = 


m m n 
. by See: a Fite) 
Qx(& ty) = di Bact Mie | ne | 1- F,(t) | 


_ Q(tk + &) 
Q(t) 
where Q(t) gives the reliability of the equipment. 
Taking into account that 


_ _ dx (é) 
w(&,tk) = dé ’ 
we obtain 
g(tk + &) 
Q(t) ~ 


It can be shown that this conclusion is also correct 
when the exponential approximation is employed. 


w(&,t,) = 


m 
Qx(é) exp {- SER é) - Fj (ty) }} 
1= 


esa ah S's 
exp {- eae J f(tyat\. 
i=1 coo 


The conditional probability density of failure of 
the equipment is now calculated on the assumption 
that at the instant 2 the equipment was in order. 


w(é,t),) = - 6) - 2 ni jfi(tk + &)exp 
m  tk+é 
{2 z al f; (t)dt + om! *gqyat} . (16) 


i=1 
As t,+&=t, then 


m Te 
= nif i(texp {- 4 f;(t)at} 
i=1 i=10 
exp4- Dn: J f(t)dt 
{Emi J Hat} 
or 
wipe when t,<t<oo, (18) 
tk k ? 
1- - g(t)dt 


which needs to be shown. 

Therefore, in order to determine the law of 
change of the average failure-to-failure time it is 
first necessary to find the conditional distribution 
densities for the components or for the equipment. 
We are of the opinion that normalization of the 
laws of distribution for equipment is the more 
convenient operation. 


CONCLUSION 


The report produced a number of asymptotic 
expressions for the laws of probability distribu- 
tion of the operation time of equipment until the 
first failure. An evaluation was made of the 
errors incurred when crossing over from exact to 
approximate expressions and it was shown that, 
when applied to modern equipment, the errors of 
approximated formulas did not exceed several per 
cent. 

The report dealt with the methods of approxi- 
mate calculation of the law of change of the aver- 
age failure-to-failure operation time of equipment, 
from any optionally chosen instant of time t,. It 
was shown that normalization of the laws of dis- 
tribution for components can be replaced by the 
equivalent, but less laborious, operation of nor- 
malizing the initially obtained probability density 
function of failure of equipment. 


REFERENCES 


[1] V. I. Siforov, ‘‘On the methods of calculating the reli- 
ability of operation of systems containing a large 
number of components,’’ Izv. Akad. Sci. USSR, no. 6; 
1954. 

[2] B. V. Gnedenko, ‘‘Course of the Theory of Probabil- 
ity,’? GITTL, m. 1954. 

[3] I. I. Morozov, ‘‘Reliability of operation of radioelec- 
tric equipment,’’ Radioelectronnaya Promishlyennost, 
no. 3; 1958. 


6 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


ON RESERVATION BY REPLACEMENT METHOD* 


M, A. SINITSAT 


Summary -—Consideration of substitution techniques 
for reliable design is usually based upon statistical inde- 
pendence of failures of operating and spare components. 

A method of design of a reliable system, consisting of 
many operating and spare components, for the case of 
statistical relation between failures of operating and 
spare components is considered. In a number of cases 
more reliable operation of systems can be achieved by 
providing substituting reservation, instead of permanent 
switching, to a spare (or “‘hot’’ reserve) component. Ef- 
fective usage of a single spare component substituted for 
several operating components is considered. 


The reservation of radioelectronic equipment 
by a replacement method, due to a number of oper- 
ation and economic advantages, finds ever increas- 
ing application in engineering. The reliability of 
systems with such a reservation cannot be evalu- 
ated by the methods described by Moskowitz, : 
Levin,” and Welber, et al.,* as they are all based 
on the assumption of statistical independence of 
failure of basic and reserve components. When 
reserving by the replacement method, each re- 
serve component is put into service after failure 
of the basic one. This means that the probability 
of failure of the reserve equipment depends upon 
the probability of failure of the basic component. 
In this report an attempt is made to consider a 
probability evaluation method of complex systems, 
consisting of many operating and reserve compo- 
nents, on condition that failure of the reserve 
components is statistically dependent on failure of 
the basic components, or reserve ones which have 
been previously placed in service. 

Let us assume that the system consists of K 
identical operating components, and m_ similar 


*This Paper was Presented at the Fifth Natl. Symp. on 
Reliability and Quality Control, Philadelphia, Pa., January 
12-14, 1959. 

TUSSR. 

1F. Moskowitz, and I. B. McLean, ‘‘Some reliability 
aspects of systems design,’’? IRE TRANS. ON RELIA- 
BILITY AND QUALITY CONTROL, no. PGRQC-8, pp. 7- 
35; September, 1956. 

7B. R. Levin, ‘‘Increase of system reliability by res- 
ervation,’’ Elektrosviaz, no. 11; 1956. 

SI. Welber, H. W. Evans, and G. A. Pullis, Bell Sys. 
Tech. J. vol. 34, p. 473-510; 1955. 


reserve ones. Each of the reserve components 
can replace any failed operating component or re- 
serve one which has been previously placed in 
service. The effect of switching devices on reli- 
ability will not be taken into account here, lest the 
problem should be complicated at the first stage. 
To determine probability of failure of such a sys- 
tem, we shall use the method described elsewhere 
by the author.* 

At first, let us consider the solution of a simple 
problem: when there are K _ operating compo- 
nents, and one reserve one which can replace any 
of the operating components. In this case failure of 
the system (during time T) may be the result of 
either of two mutually exclusive events: 


1) Failure of not less than one operating com- 
ponent and a reserve one. 

2) Failure of not less than two operating com- 
ponents with the reserve components operating 
properly. 


Let us divide the greatest possible operating 
time of components without failure into W_  in- 
tervals, of duration Av. Let us take as hypotheses: 


a) For the first event, H,: failure of the sys- 
tem during interval i. 

b) For the second event, Hz: failure of not 
less than two operating components during inter- 
val j. 


The probability of system failure during time 
T can be determined with the aid of the formula 
of the total probability: 


(ther ok i ¥ j 5 
Py (T) = & AP(Hi)p(T/7y) + Z AP(H} )a(T/7)), 
i=1 j=l 
(1) 
where 


AP(Hi) = probability of failure of not less than one 
component during interval i [probabil- 
ity of hypothesis a)] 


*M. A. Sinitsa, ‘Reservation methods of radioequip- 
ment,’’ Elektrosviaz, no. 7; 1958. 

5Note that the notation p(T\7 ,;) apparently does not 
mean p(x) where x is the quotient T\T,;; but is rather a 
way of indicating functional dependence on both T and 
ae though written p(T\7j, or p(T;7;), or p(T,7;). 


1960 


AP(H!) = probability of failure of not less than two 
components during interval j [proba- 
bility of hypothesis b) ] 

p(T/7;) = conditional probability of failure of the 
reserve component during time T. It is 
obtained on the supposition that the first 
component of the operating system has 
failed during interval i 

q(T/ 7) = conditional probability of the reserve 
component successful operation during 

time T. It is obtained on the supposi- 
tion that the reserve component has 


been placed in service during interval j. 


Increasing the number of intervals to infinity 
(w—~ oo) and passing over to limits, we obtain 


(1) iy qj 
Py (T) J dP(Hi)p(T/7) + J dP(H2)q(T/7). — (2) 
fe) fe) 


Let us consider the expressions within the in- 
tegral. The differential of the probability of hy- 
pothesis a) is the probability that failure will 
occur during interval dt at T: 


dP(H:) = r(1) (t)dt, (3) 


where ri) (t) is the density of the probability of 
failure in the system of not less than one operat- 
ing component from the total number of K com- 
ponents. » (1) 

As is shown elsewhere by the author,” rj ‘(t) 
can be expressed in terms of density of the proba- 
bility of failure of components f(t), according to 
the formula 


2) (y = ict) i (tat 2, 


Similarly 
dP(He) = r\”? (t)at, (4) 


where r()(t) is the density of the probability of 
failure of the system, assuming that not less than 
two components have failed. r\“’(t) can also be 


expressed by the functions {(t): ° 
t 


r)(a) = K(k - 1)ECt)| fatyar] =~? f a(rar. 
t O 


The conditional probability of failure of the re- 
serve component, and the conditional probability 


6M. A. Sinitsa, ‘‘Reservation of radioelectronic equip- 
ment,’’ Radioelektronnaya Promushlennoct, no. 5; 1958. 


SINITSA: ON RESERVATION BY REPLACEMENT METHOD i! 


of its successful operation, are determined by the 
equations T 


p(T/7) = J £(7,t)dt 
fo) 


(5) 
q(T/T) = fa f(7,t)dt 
T 


In these equations the function within the integ- 
ral will have various expressions at different time 
intervals 


f'(t) when t<7, 
f(T ,t) = 
f''(t) te when Ste 7 ~ (6) 


Taking into account (3), (4) and (5), the expres- 
sion (2) will be: 


at 3h is a) 
P(t) = Men) \! i(r,t)at} ar) ne UL (r,t)at Jar. 


(7) 


The formula (7) allows us to evaluate the prob- 
ability of failure of the system consisting of K 
operating components and 1 reserve component, 
which have been placed in service after failure of 
the operating component. It is a most common 
formula for a system with 1 reserve component 
in which failure can be statistically both a depend- 
ent event and an independent one. For example, 
the expression for the probability of failure of a 
system with a hot reserve component (statistically 
independent component) can be obtained as a par- 
ticular case of (7). Indeed, as f(r, t) = f(t), the 
expressions within the integral for both cases of 
(7) decompose into two functions of independent 
variables 


ii T Tt. re) 
PYery = 5 2M (ayar f aetdat +f r'?)(r)ar J a(t)at . 
k k k 
re) Oo Oo T 


Considering that 


iS Rs kK ee. an) : 

J fh (r)dT= om f(r)d7]*[ i} i(r)a7] =~? . 

(0) i=j O Ak 

Ak me) (8) 
J £(7)dr = p and > i(7)dr =4, 

fe) T 


the last expression will be 


(1) E pet 3 2.2 k-2 k 
P. (T) 3 (C, pd a CyP q piceotee sels p )p 
k-2 k- 3 k 
+ (Ck p’4 + Cyp*q +...+p )q. 


8 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


After rather simple transformation we obtain: 


(1) fal Nk heal art k- 2) 
per) = (ch + C?)pta 


5a (Ghee )p deep 4a : 


> (Cy. + C,)p°4 


Using the property of binomial coefficients we can 
write: 


1 EP 2 3 _ a3 
Ciccte S panes 132 Ce Chey lee aaa 
k-1 kek 
C,. tC, = Cy 5 
then 
(1) = 2 2 1 3 gekieie 
Be ey ep ead tC bed Ex? 

1 k k+1 

aC bee D z 
or 
k+1 
(1) = i i k+1-i 


Thus, we have obtained the expression which 
completely agrees with the formula of the proba- 
bility of system failure with one hot reserve com- 
ponent, previously described.°® 

Let us consider the solution of the same prob- 
lem for the case of two reserve components in a 
system with all other conditions being retained. 

For convenience a system without reserve 
components will be referred to as a ‘‘basic sys- 
tem.’’ Failure of the system during time T isa 
case of the sum of three noncoincident events: 


1) Failure in the basic system of not less than 
1 operating component and 2 reserve ones which 
have been placed in service in succession. 

2) Failure in the basic system of not less than 
2 operating components and a reserve one, with 
the other reserve component operating properly. 

3) Failure in the basic system of not less than 
3 operating components with 2 reserve ones oper- 
ating properly. 


The hypotheses of these three items are: 


a) the hypothesis of the first item H, is fail- 
ure of the basic system and the first reserve 
component within a time interval; 

b) the hypothesis of the second item Hz is 
failure in the basic system of not less than 2 
components and a reserve one within an ele- 
mentary time interval; 

c) the hypothesis of the third item is failure in 
the basic system of not less than three components 
within the one of an elementary time interval with 
the reserve component operating properly. 


APRIL 


The probability of these hypotheses are: 


(yee 
dP(Hi) = ry (7)[ J £(7,t)dt Jar, 
0) 


(Qa a 
dP(H2) =r, (7) J £(7,t)dt]dz, . (10) 
Oo 


dP(H3) = ra) J £(7,t)dt]dz 
T 


2 3 
The probability densities, 3) (ty and ry, 


can be evaluated, as is shown by the author,® 
according to the functions f(t) 


(i)¢, k! 


ry. = Ge aa ou)! 


iE : fe) E 
(t)[ f t0rar}!~ tL fear] = 4 
Oo t 


(10a) 


In the same way, the probability of system 
failure with two reserve components can be deter- 
mined, according to the formula of the total prob- 
ability 

(2), eee 
P(t) = J dP(Hi)p(T/é) 
fe) 
y 
+ 2 J dP(H2)q(T/A) 
fe) 


T 
+ J dP(H3)q(T/@) , 
fe) 


(11) 


where 


p(T/£) = conditional probability of failure of the 
second reserve component. It is obtained ~ 
on the supposition that the second reserve 
component has been placed in service at 
the moment & >7, that is, when the 
second operating component or the first 
reserve one has failed. 

q(T/X) = conditional probability of the reserve com- 
ponent proper operations. It is obtained on 
the supposition that the reserve component 
has been placed in service at the moment 
Nee Tie 

q(T/@)= conditional probability of the reserve com- 
ponent proper operation. It is obtained on 
the supposition that the reserve component 
has been placed in service at the moment 
0. 


Taking into account that 


1960 


T 
p(T/é) = J £(&,t)dt, 
oO 


q(T/r) = J £0,t)dt, 


a 
a(T/6) = J £(6,t)dt, 
T 


and making allowance for (10), formula (11) will 
be: 


i at T 
PA py = fr (yar f sr eae J 1(8,0at 
oO oO Oo 
T TE x 
+ 2J rer) ff £(7,t)[ J f(a,t)at Jat} dt 
Oo oO 48 


eDiets . 
+ fr)! £7 tat] J £(6, that Jar. (12) 
fe) T a8 
Here again, the evaluation of conditional proba- 


bilities is performed for two time intervals. In 
this case density probabilities will be 


Zt). when { <2, 
eA )ia 

f"(t) when t>é, 

f(t) when t<aA, 
f(A,t) = 


f(t) when t >A. 


Eq. (12) is given ina most common form. To 
make it applicable in practice we should specify 
the succession of integration. The first item of 
the right-hand side of the equation contains the 
probabilities of two groups of events, for which the 
succession of integration differs one from the 
other. The first group of events is the failure of 
exactly one component in the operating system and 
each of the reserve ones which has been placed in 
service in succession. The second group of events 
is the failure of not less than two operating com- 
ponents and reserve ones which have replaced 
each failed component. In view of the aforesaid, 
(12) will be: 


ay a fi 
Den capi gtl 
ee V Hr E)LJB(E aden Joba 


SINITSA: ON RESERVATION BY REPLACEMENT METHOD 9 


T (2) a 
+ FLL (7, £)a8 F ar 
oO Oo 


at 
z= zi cn){ 


Q Sas 


i(r,tat[ J4(r,at]} dr 
T 


T 0° 
+f refs f(7,t)dt]? dr. 
O T 


(13) 


Here nk! Day = density probabilities of failure of 
exactly one component out of K operating ones, 
and hk(1) (rt) is determined according to the for- 
mula obtained elsewhere by the author.° 


t ; t 
£(t)[ [£(7)dt]° [i- k Jt(7)d7] 
Oo oO 


1 


i -i1 
k 


(i) ay = 
hy (tee 


ee k-i- 
[ J £(7)dr] 
ay 
Now let us follow the change of (13) as the re- 
serve environment is changed. If the reserve 


components do not wear out before being placed 
in service, then 


Os when € <47, 


(14) 


0 whendA<é, 
f(r, &) -| £(€,) -| 


f(€) when & >7, 
and (13) will be 


f(A) whendA > E, 


Li T-7. T-1-& 
paler) = Ineo Jol J f(x)dn] de} ar 


-T T T-T 
f(E)dE J? dr + 2 pr ery{ ft tat 


T 
+ fran 
O O O 


ona | 


aS T (3) fore) 
fas g(tyat]bar + fry\(r)L_ J a(t)at]? ar. (15) 
oO 


T-T T-T 


If the reserve components are hot, that is, 

£(E,t) = f(t), f(a, &) = f(t), ..., then the expres- 
sions within the integral in (13) decompose into 
function products of independent variables: 


T T T Te 
P(r) pu (ryan Jt(2)ae JL [400A] + fe") (oar 
Oo O O (@) 


7, T (2) rT oo 
[ [£(E)dé ]? + 2J r (r)dr J£(t)dt[ J £(t)dt ] 
fe) fe) (o) T 


T (3) fore) 
+ pr (r)d7[ J £(t)dt]. (16) 
O T 


10 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


In this case 


ae 


( r)ar = ms kpq' ? 


ale 


k 
2 k-i 
( 2) (nar = =2 Ch’ = 


i=2 


iy oTH 


| fo) 


therefore the formula (16) will be: 


k-1 


(2) k 
Pi. (T)= (C, pq +C) p? qk=8 3 tp )ps 


k-2 k- k 
2 2(Ci pq +Cy pq ere tDADG 


a .k-=4 


k 
PC OD dso siene a) 


After some transformations we obtain 


pir T) = (ce +C oP 4 oo 


k 208 


k k 


(C+ 2cy + Ci)p*4 ieee 


a eg 1 a 2cy}ps* 4 is ete 


In view of the property of binomial coefficients we 
obtain 


Cige 20a = Chae, 
Cy - 2c + Gs Ce 9, 
ee eacrs OF) 
Therefore 
POUT sCUr pa merce nta 
HOM jemiakeprips (17) 


And this is nothing but the probability of system 
failure consisting of K operating components 
and two reserve ones for case of hot reserve. 
Thus, (13) is common for evaluating the reliability 
of the system, consisting of K operating compo- 
nents and two reserve ones regardless of the en- 
vironment of reserve components. 

The above method can be applied for a system 
with three reserve components. The formula for 
determining the probability of system failure 


APRIL 


during time T should allow for the following 
groups of events: 


1) Failure of exactly one component in the basic 
system and all the reserve ones which have been 
placed in service in succession instead of the 
failed component. 

2) Failure of exactly two components in the 
basic system and all the reserve ones; one of the 
reserve components replacing the first failed com- 
ponent and the other two reserve components re- 
placing the second failed component in succession. 

3) Failure of not less than three components in 
the basic system and all the reserve ones, each of 
the reserve components replacing one of.the oper- 
ating components in succession. 

4) Failure of not less than two components in 
the basic system and two reserve ones with one 
reserve component operating properly. 

5) Failure of not less than three components in 
the basic system and a reserve one, the other two 
reserve components operating properly. 

6) Failure of not less than four components in 
the basic system, all the reserve ones operating 
properly. 


Such a plan of events allows us to determine the 
succession of integration in the formula of the 
probability of system failure having three reserve 
components. 


(3) T (1) T T ar 
pln) = [ny (nar J tlr,£)dé f £(E, addr J £0,0)¢9 
O O O oO 
ra fi 
Def it, pL (6 nan Jat it, s)aebar 


- Gren Lariat ss 
+Jry (7) [J£(7,t)dt ]? dr + 3 Jr," (z)[ J£(7,t)dt]? 
Oo re) Oo Oo 


es (3) anos - 
[J £(7,t)dt]d7 + 3 re (r)LS £(7, that [S4(7, that far 
T Oo Oo T 


T Oe) 
+ rye (Lr that ar . (18) 
Here iy nee! eee Med “fhe are de- 


termined eae to the formula (14) and (10a). 
If we apply the above method to the case of the 
hot reserve, (18) will be: 


(3) ae 4 4 k-1 5 k-2 
Pp. (T)=C) 3P 4 + Cy oP 4 +... 
k+2 k+3 
+ Crate ae (19) 


1960 


It is a particular case of the formula for deter- 
mining the probability of system failure with a hot 
reserve. 

Using the same plan one can work out a formula 
for determining the probability of system failure 
system, consisting of K operating components 
and m reserve ones. This formula is very com- 
plicated, therefore there is no need to give it here. 
But we shall point out the basic groups of events 
that the formula should allow for: 


1) Failure of not less than one of the total num- 
ber of components in the basic system and all the 
reserve ones. 

2) Failure of not less than two components out 
of K ones in the basic system and m-1 reserve 
ones with one reserve component operating prop- 
erly. 

3) Failure of not less than three components 
out of K ones and m-2 reserve ones with two 
reserve components operating properly. 


m+ 1) Failure of not less than m+ 1 compo- 
nents out of K ones with all m reserve compo- 
nents operating properly. 


Each of these groups, except the last, should be 
divided into subgroups, as was the case when m=2 
and m = 3. This subdivision will give the possibil- 
ity to determine the succession of integration in 
the formula. 

While solving the problem of determining the 
probability of system failure, some particular 
cases of the system containing K operating com- 
ponents and 1,2,3,..., m reserve ones were 
considered. In these cases we did not set any lim- 
its on m. It is evident that the quantity of reserve 
components can be either more than the number of 
the operating components, or less than it. 

Thus, the resulting formulas for evaluating the 
reliability of complex systems consisting of K 
operating components and m reserve ones, are 
the most common formulas, as they allow us to 
estimate the reliability of systems with a hot and 
cold reserve which can be in any environment. 
The main formulas”** can be obtained as particu- 
lar cases of these generalizing expressions. 

At present in radio relay systems one or two 
reserve circuits per several operating ones are 
used. The same method of increasing reliability 
can be applied to a number of other radio- 
engineering systems. It would be of interest to 
consider the effectiveness of such a method of in- 
creasing reliability, and when it should be used. 
For this purpose let us consider the following 
relationship 


SINITSA: ON RESERVATION BY REPLACEMENT METHOD 11 


W= W(q), 


where q = reliability of operating and reserve 
components, and 
W= reliability gain. 

In general the reliability gain can be deter- 
mined as the ratio of the probability of system 
failure without reserve, to the probability system 
failure with reserve 


P(k) 
P(k,m) ° 


Wes 


It is assumed that the system consists of K oper- 
ating components and~m reserve ones. The 
probability of system failure without reserve, con- 
sisting of K components, can be determined by 
the equation 


P(k) =l1- 12 


and if all the operating components have equal re- 
liability, the equation will be 
P(k) = 1-q*. 

The probability of system failure with reserve 
P(k,m) is determined by various formulas depend- 
ing upon the environment of the reserve component. 

In case of cold reserve and m= 1, P(K,m) can 
be evaluated by (7), and when m = 2, by (12). In 
cases of hot reserve the probability of failure is 
evaluated by a more common formula 


72 - Ok, 


k 
= m+i m+ti kK-i 
P(k,m) = = Sietiien ; (20) 
i=1 
Here p=1-4q. 
Thus, when. gi = @ =...=q, and m=1 
when having hot reserves, we obtain: 
Hie: k 
= san! Rime a Lt earn Ag (21) 
mcktsemed) kai 
z Cx+m 


With the aid of this formula the curves w= wp(q), 
given in Figs. 1-3, are obtained. Analyzing these 
curves one can see that the use of one reserve 
component per several operating ones when m = 1 
gives substantial gain in reliability only when 
q > 0.7 - 0.8 (see Fig. 1). This gain rises sharply 
as q increases, For example: when k= 5 and 
q=0.8 then W= 2, and when q=0.95 then 
W=7. Certainly, when there are less operating 
components this gain in reliability will be still 
greater. Thus, when k= 2 and q= 0.95, then 
W = 13.35. 

The term ‘‘multiple reservation”’ in this case 
means the ratio m/k. Here again, as in the case 
K = 1, the increase of multiple reservation 


12 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


m=4 


Fig. 1—Reliability gain. One ‘‘hot’’ reserve component. 


reduces the probability of system failure, that is, 
W increases. However, it should be pointed out 
that the gain in reliability rises substantially with 
the increase of M even when maintaining the 
multiple reservation m/k= const. For example, 


when q=0.8 and < = 6 we find according to the 


diagrams (see Figs. 1, 2, and 3) that when m = 1 
then W = 1.75; when m= 2 then W = 3.7; and 
when m= 3 then W=9. For higher values of 
reliability of operating and reserve circuits 

q > 0.8 the increase of W is still more sharp. 
This is explained by the fact that each reserve 
component can replace any failed one out of the 
total K components, and even reserve ones which 


Fig. 2—Reliability gain. Two ‘‘hot’’ reserve components. 


APRIL 


Fig. 3—Reliability gain. Three ‘‘hot’’ reserve compo- 
nents. 


have been previously placed in service. 

Now let us estimate what advantages we shall 
get when using cold reserve. For this purpose let 
us use the functions Peojg = $1(k), Phot = $2 (k), 
Weold = ¥1(K), Whot = ¥2(K) when q = const 
and m= 1. The functions Peojg = $:1(K), Phot = 
@2(K) and Wy o4 = W2(K) are determined accord- 
ing to (7), (20) and (21) respectively and the func- 
tion Weold = Wi(K) according to the formula 


P(k) 


Weold ~ Peoig(m) ’ 


(22) 


where Peojq(K,M) is also determined according 
to (7): 

The curves which have been plotted after these 
formulas are given in Fig. 4. A curve of the 


Fig. 4—Comparison of ‘‘hot”’ or ‘‘cold’’ reserve. 


1960 


probability of system failure without reserve P(k) 
as well as that of the probability of system failure 
when m = 2 for hot reserve Ba 
Fig. 4 for comparison. The diagram is plotted for 
the case of the law of equal probability of compo- 
nent failure f(t) = const, when the probability of 
each component failure during time T is equal to 
p= 0.1. The diagram shows that cold reserve is 
much more preferable than hot reserve when there 
are one reserve component and few operating ones 
in the system. With the increase in the quantity of 
operating components in the system up to 5-6 this 
advantage decreases (see Table I). 


are given in 


TABLE I 


It does not mean, however, that cold reserve 
should not be used at all. Cold reserve gives the 
possibility of economizing the energy of power 
supply and when two or more reserve components 
per 5-6 operating ones are used, a considerable 
gain in reliability is obtained. 

One can solve a reciprocal problem with the 
aid of the diagram given in Fig. 4, that is, to de- 
termine the necessary quantity of reserve circuits 
for securing given reliability, if the quantity of 
operating circuits and their reliability are known. 
For example, it is necessary to secure the proba- 
bility of system failure of not more than 0.09 in 
the system consisting of 6 operating circuits, each 
having the reliability of 0.9. Using the diagram 
(see Fig. 4) we determine that such reliability of 
the system can be secured when there are two 
reserve circuits. 

Using the same principle it is possible to plot a 
diagram for various values of p and m by plot- 
ting curves for Phot(m) and Peoig(m). These 
diagrams offer opportunity to solve the above 


SINITSA: ON RESERVATION BY REPLACEMENT METHOD 13 


problem for systems with various values of p 
and k. With the increase of m the significance 
of cold reserve also increases, and in some cases 
its use will allow us to decrease the number of 
reserve circuits while maintaining given reliabil- 
ity. Besides, it is considered to be expedient to 
use both hot and cold reserve when m > 2. 

The problem of determining system reliability 
with cold reserve has been solved on supposition 
that the probabilities of failure of all the operating 
and reserve components are equal. It is the most 
common case in practice for the reserve compo- 
nents intended for carrying out the same functions 
as operating ones to consist of the same elements. 
In some cases, however, operating and reserve 
components can differ greatly from each other as 
to their reliability. For example, a component 
may or may not include electric vacuum devices. 
In this case, the method of determining reliability 
remains the same but formulas will be more com- 
plicated, as each item, before which a factor of 
the type Cre stands, must be represented as a 
sum of inequivalent components. 

The method of evaluating reliability of complex 
systems consisting of several operating and re- 
serve components given in this report offers oppor- 
tunity to estimate the reliability of both available 
(present) systems and projected and developed 
ones, as the total reliability of a system is ex- 
pressed by the laws of probabilities of its com- 
ponents failure, which can be obtained before. By 
changing the integration limits the system reli- 
ability during any operating period T can be 
evaluated. Moreover, using this method one can 
estimate the quantity of reserve components or 
circuits necessary to secure given reliability of 
the system. 

The reliability estimation of a system with 
many operating and reserve components was given 
without accounting for the reliability effect of 
switching devices. The effect of switching devices 
on total system reliability can be easily taken into 
account for this case.®° The accuracy of the evalu- 
ation of system reliability will be completely de- 
pendent upon the accuracy of initial data: the laws 
of probabilities of failure of components for given 
operating conditions. 


14 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


SOME RESULTS OF MATHEMATICAL RELIABILITY THEORY * 


B. R. Levint 


INTRODUCTION 


The problem of reliability of electronic sys- 
tems, (i.e., their ability to fulfill specified func- 
tions under specified conditions) has two aspects: 
experimental and theoretical. The first concerns 
a statistical analysis of data obtained in the 
course of operational use, while the second con- 
cerns a mathematical analysis. The mathemati- 
cal reliability theory is based on the statistical 
definition of component (or system) reliability, as 
probability of satisfactory operation during a 
given portion of time. 

The mathematical reliability theory provides, 
for the engineer, the possibility of finding a ra- 
tional method to predict reliability of the system 
by statistical analysis of experimental component 
reliability data. 

Provision of these data is a necessary condi- 
tion for practical application of the mathematical 
reliability theory. 

Some results of the mathematical reliability 
theory obtained by the author are described in 
this paper. 


RELIABILITY ANALYSIS FOR A 
FIXED FRACTION OF TIME 


Consider a system as a matrix composed of 
lA; independent components. Each component 
is marked by letter Ajj where indexes i and j 
are numbers of row and column of the matrix, re- 
spectively. For a fixed interval of time T com- 
ponent Ajj is characterized by reliability Pjj. 
Thus, there is a reliability matrix | P;;| corres- 
ponding to the component matrix. 

In each row the components are serially con- 
nected (whenever one of the operating components 
fails, the whole row breaks down). The rows are 
connected in parallel; thus the system breaks 
down only in case of failure of all the rows. 


*This paper was presented at the Fifth Natl. Symp. 
on Reliability and Quality Control, Philadelphia, Pa., 
January 12-14, 1959. 

TUSSR. 


Redundancy in the system with the corres- 
ponding matrix of m xn rows and columns can 
be provided by splitting the matrix into r<n 
submatrices of my x sk (for k= 1, 2,...r) 
rows and columns serially connected. This 


ie 
means that 2% Sk =n. 
k=1 


Reliability of the system will be given by 


r My Sk 
p= 1 {1- 1(1- 0 pi) (1) 
k=1 i=l j=S,4+1 
where 
Q 
Sy = e Bk (SDs So) 


If all the components have equal reliability (Pij = 
p), and the matrix is split equally into r submat- 
rices each of m rows and n/r columns, then (1) 


becomes 
n 


P=[1-(1-p)' ]° (2) 
for two extreme methods of substitution: the first 
is the system standby |r = 1| and the second is 
the element standby |r = n|. 

If the reliability P is specified the required 
reliability for each component can be computed 
from (2). 

ee = 
p=[1-(1- Pt) ?]", (3) 
It is seen from (3) that when the required reliabil- 
ity of a system is close to 1, the component relia- 
bility should be at least Vn times stronger in 
case of component substitution scheme than in case 
of subsystem substitution scheme. 

It is assumed that every submatrix has a fail- 
ure detector, and that whenever one of the operat- 
ing components fails, the failure is detected and a 
Spare Subsystem of components is inserted. Pak - 
reliability of the detector and switching unit is 
introduced; then the system reliability equation 
becomes 


1960 


P= ]7 {1 - (losaacll, pj) 7 (1- Pak IT Pij)}- 
k=1 j=S,, °° i=2 j=Sy-4 


(4) 
Let the reliability Pj; = P be equal for all the 
components as well as the reliability of the 
Switching units Pag, = Pa. 
A number of operating components in every 


group (submatrix) is : and a number of spare 


components is m- 1. Then (4) becomes (5). 
n oe 
Pa(t-(f-pryisp ys | 

Comparison of the two redundancy methods 
(splitting the system to r / <r groups) shows 
that the number of switching elements in the first 
case is more than that in the second one. How- 
ever, system reliability in the first case is al- 
ways higher. A special case of this statement 
may be formulated like this: whatever failure 
detector element reliability may be for any num- 
ber of operating and spare units, the method of 
element standby is much more effective than that 
of system standby. 


RELIABILITY vs TIME ANALYSIS 


Up to this point, the time interval has been 
fixed. If it is changed, the reliability becomes a 
time function. The time moment at which the 
first failure of the element takes place can be as- 
sumed to be a random value £éij and then 
hi Pi (t) will be an integral function of this ran- 
dom ate distribution. If Pjj(t) is continuous 

dpij(t 
the probability density wj;(t) = - Pris) and 
characteristic function 64;(v) = Jw4j(t) e!”* at 
fe) 


can be derived. ‘‘K’’ order distribution moments 


will be 
$= f t w::(t)dt = k/ t .-(t)dt. (6) 
ma {553 J wij ( ) 0 Pij 


From (6), as special case, mean time of failure- 
free operation is derived, 


t* =m {é;}- F pag(t dt (7) 


as well as time variance of a satisfactory opera- 
tion. 


of. = M2 - m4 = 2) tpy(tat = (F pj(av’. (8) 


LEVIN: SOME RESULTS OF MATHEMATICAL RELIABILITY THEORY 


Eqs. (7) and (8) may be obviously used for mean 
time and time variance of system failure-free op- 
eration if we substitute reliability P(t) for Pj,(t). 

The probability theory technique, given Pij dt), 
permits us to determine reliability as a time func- 
tion for system Ajj for two limiting methods of 
Switching: hot (all components of a given matrix 
column are switched in simultaneously) and cold 
(each component is switched in only if previous 
one is off). 

For reliability under condition of switching in 
the spare components by the hot method we may 
obviously use the previously derived equations by 
substituting constants Pij for function Pij (t). 

When switching in the spare components by the 
cold method the reliability Pjj(t, £\j-1) depends 
on the random parameter &j ,j-1 —the time instant 
when the component Aj j-1 fails. Thus consider 
only the mean reliability of component Ajj, that 
is: 


Pi(t) = mi{ Py(t, &j-D}, (E079 (9) 


The characteristic function of the random parame- 
ter ei yet owing to the independence of compo- 
nents becomes i 


4; j- (Vv) = IT 0: i,k). 
k=1 


(10) 


Then it is easy to show that 
t oo 

Pyj(t) oJ Wj j-1(8)Pisl(t - x)dx + LW pote (11) 
O 


where Wi ,j-1 


ivt 


Tae - 
Wi,j-1) = an! 91,j-1Me dx. (12) 


REPLACEMENT OF FAULTY ELEMENTS 


Further analysis of reliability theory in the 
way of failure counting is somewhat significant. 

Let & be a random instant time of failure of a 
component switched at t= tp. Under the assump- 
tion that the failure probability in (to,t, + t) por- 
tion of time depends only upon its length and does 
not depend on the to moment of switching, integ- 
ral function of random variable distribution can 
be computed from 

t 
P{to<é K to + t}= J witat (13) 

where w(t) is the random variable distribution 
density. 


16 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Let us name the portion of the time that system 
will be out of operation—a down time. The down 
time is a summation of a number of random val- 
ues: a random portion of time required for 
searching a faulty element (71), a random por- 
tion of time required for repair (n2), a random 
portion of time required for maintenance (73) and 
soon. If m,%2,%s - are independent, a char- 
acteristic function method is suitable for down 
time distribution analysis. 

We can assume failure search time initiated at 
ti, repair time initiated at tz, and maintenance 
time initiated at tz; are independent of ti, te 
and ts; respectively. Their probability depends 
however upon specified down time. 

Let 67:(v), @N2(v), ON3(v) be characteristic 
functions of random values 71, 2, 3. Then the 
characteristic function 6n(v) of their summation 
n= 27; (that is of down time) becomes 

i 


(v) = IT 6 (vhe (14) 


0 
1) i ur 


Let us also assume that time of satisfactory 
operation of the element and down time are inde- 
pendent random values. The sum of these random 
values = £&+%7 determines a random failure sep- 
aration. Let us introduce the characteristic func- 
tion of a random value & 

[o 2) . 
9 (v) =f wither” at. (15) 
“es. = 0 
The characteristic function @(v) of the random 
value (&) is determined by the product 


61(v) = 6 ,(v) 6,,(v)- (16) 


The calculation can now be made of probability 
F(t) that failure separation does not exceed t 


t co : 
F(t) == J f @:(v)e *avax (17) 
O-<co 
where 
Wrlt)= + F [asv) Pet ay, (18) 


Let ¢,(t) be probability of n failures in (0,t) 
portion of the time. Then summing of the proba- 
bilities gives 

én(t) = Fy(t) - Fy, q(b. (19) 

Under the assumption the number of failures in 
0,t portion of the time is a random value v, the 
probability of » <n can be computed from 


n 
Sb p(t)= = Besta & HA(20) 


n+1 


This equation determines probability of satisfac- 
tory operation of the given element by providing 
‘‘n’? spare elements in (0,t) portion of the time. 

Let us derive the expression for M(t), the 
mean of the statistical random variable v, (i.e., 
the mean failure rate intime t), by introducing 
the Laplace transformation of function W,(t) 


pi Sg -pt 
6,(ip) = f W,(t)e * dt, (21) 
oO 
where (p) is a complex variable. 
Thus 
wie yea Shebaee irl) Loam siaae 
M(t) = 2ri 1-6,(ip) ° - (22) 
a-ico i P 
We can also express W,(t) via M(t) 
sae pulp) pt 
Wilt)= J —=* $e dp, (23) 
a-ico 1 * PH(p) 


et ee -pt 
where p(p)=J M(t)e * dt. 
O 


Using the expression (23), we can compute the 
time between failures probability density. W(t) 
for any given mean statistical number of failures 
for any given (specified) time t. The average 
number of failures referred to interval of time t, 
bo 
minology in the probability theory, may be defined 
as the failure rate. 

It is obvious that the value of X(t) is the 
a priori failure probability in time interval 
(t, t + dt). The asymptotic value of the failure 
rate in case when ‘“‘t’’ is too high can also be 
easily found. 

Since M(t) =0 when t <0, by using the 
Tauber theorem for the Unilateral Laplace trans- 
formation, the following equation can be obtained: 


according with the established ter- 


Jin hy Sein? =e 


pau ce a. pla g so gaat 
where 7* is the mean time interval between the 
two successive switchings on of an element. 

If the average failure-free operating time of an 
element is t* and the average down time is t,* 
then, due to their independence, 


Tee tet, 


1960 


Let us consider the probability of the ‘‘K’’ ele- 
ment (K = 2,...r) failure in time interval 
(t, t+ dt) provided that the failure of the basis 
element occurred when t = 0. 

Using the rule of summing probabilities one 
can obtain 


Pit<z hee esa gope oe) 
<ttdt.. 1 2 = n(t)at (25) 
m(t) = MM - 

Pit< x Deteaa tem EE at oh 


Thus the value m(t)dt is the d posteriori prob- 
ability of failure in time interval (t, t + dt). 

It should be taken into consideration that the 
function m(t) dt is not the probability density of 
a random value and that consequently it does not 
have the elementary features of distribution func- 
tions (for example {*m(t)dt diverges). 

fe) 


The function m(t) may be defined as the failure 
differential probability of an element at a given 
time moment regardless of failure rate before that 
moment. From (22) and (25) one can readily find 

atico 


“sat aie f 


‘Ori 
a-ico 


91 (ip) pt 


M(t) = 1- 6:(ip) 


e@>.dp. (26) 


Using the above procedure we can obtain the 
asymptotic value of the failure differential proba- 
bility of an element after termination of a long 
time interval since the failure of the basic element 

61(ip) = 
Qim m(t) = = im p feeee tiny (ip) ‘ 


t—> co p—oO 


= ae A(t) = 


Let us consider some typical cases illustrating the 
theory. 


Instantaneous Switching of a Spare Element 


In case of an instantaneous switching of a spare 
element the probability density of the random val- 
ue 7 is a delta function and the distribution of 
time intervals between failures coincides with the 
- distribution of the failure-free operating time of 
the element. For the exponential element relia- 
bility distribution law if the average time is t* 

_ we can obtain 
atico % ae 
et Be ed pe | 2 ed 
Onl) ori) + pe a fs 2 ay 


LEVIN: SOME RESULTS OF MATHEMATICAL RELIABILITY THEORY 


ty 


This expression is known as the Poisson distribu- 
tion law, and its presence in the given case might 
have been predicted. 

The probability of no more than ‘‘n’’ failure 
counts during time t is: 


n Dae re 
p{v<nh = ee 2 a(é) t/t ei 
oO 


r= 


f Ze 
t/t* 


4 az 


Ene itt) 
I'(n + 1) J 


where J°(n +1, t/t*) is an incomplete gamma 
function. 

The failure rate A(t) coincides in this case 
with the differential probability of the switching of 
an element and is constant for any time moments 
and equal to 1/t*. 

The reverse relationship can also be readily 
proved: if the failure rate is constant for any t 
the distribution law of time intervals between fail- 
ures is also exponential. 

Taking the formula of the operational calculus 
it can be shown that 


=l- (28) 


A(t) gig u.(s)dS 


p 
i co 

and if A(t) = ane const, then J p(s)dS = oe , (29) 

p 
and thus 

1 
u(p) = re 

-t/t* 

or witt)=ae ¢ 


It can be seen from these equations that the 
failure rate will be constant if the distribution law 
of time intervals between failures is exponential. 


Constant Down Time 


Let us assume that down time is constant and is 
equal to t3*, and the reliability distribution law is 
exponential. Under these assumptions the proba- 
bility density of the failure duration is 45(t - t3*). 

The probability of exactly ‘‘n’’ failure counts 
during time interval (0 - t) is 


r (n, we r(n ae Ul nt 1)ts* ) 
Cae ie ge te Re” 
t > (n+ 1) ts* (30) 


18 


£ [ne tt) 


[ (met) 


P {ven{=1 - 
12 3 4 
m=3 5 10 IS 


ot EO 
— t-o1t 


0.5 


Fig. 1—Probability of not more than n failures 
in time t. 


The probability of failure counts within time 


interval t not exceeding ‘‘n’’ is: 


r(n + 1, oS n+ MES) 
SE i Cicarnney <1 a aa 


tas Aneel} ty. (31) 


If n >30 we may take advantage of the asymptotic 
equation 
), 


(32) 


where F is a well tabulated Laplace function. 
Fig. 1 shows the curves of probability P{v < n} 
vs t/t* for various ‘‘n’’ and t,*/t*=0.1. The 
dotted line corresponds to the probability for the 
limit case when t;* = 0, that is for the Poisson 
distribution law. 


¢ ¥ 
Pfy <n} ~1- F(2 eat: fi 


The Down Time Exponential Distribution Law - 


In conclusion it would be of some interest to 
examine the case when the failure duration as well 
as the operating time remains unchanged in dis- 
tribution by the exponential distribution law. In 
this case if ts* <t* 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


0,01 O08 OS 


th 


Fig. 2—Failure rate and differential failure 
probability. Exponentially distributed 


down time. 
he ee ee 
Ont) = TT@- iy! A or” Yt 
5 4 
(lee) y 
e dy. (33) 


The failure rate is 


sade diat gl ger eet te ter 
t* + t3* (t* at t3*)t 


(t* + te*)t 
t*t,* 
b) 


( = hy: 
(34) 


and the differential failure probability in time t 
after failure of the basic element is 


A(t) 


* (t* + t3*)t 
1 t*ts* 


m(t) “Tk + tk l-e 


(35) 


The curves of the failure rate and differential 
failure probability vs t;*/t* are shown in Fig. 2. 


1960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 19 


PURCHASING RELIABILITY* 


E. J. BREIDING* 


Summary—The contributing responsibility and 
a system for controlling component reliability as 
part of the procurement function are discussed. 
The system, which is particularly applicable to 
large-scale procurement, consists of three major 
controls: 


1) Component Vendor Approval Procedure—a 
tool for obtaining and documenting bona 
fide sources capable of supplying compo- 
nents to specified time and quality require- 
ments. 

2) Vendor Delivery Performance—a purchas- 
ing control for maintaining and upgrading 
performance from selected vendors. 

3) Vendor Quality Rating—a quality control 
tool to aid purchasing in maintaining and 
upgrading quality of products from selec- 
ted vendors. 


COMPONENT VENDOR APPROVAL 
PROCEDURE 


The first step for obtaining bona fide sources 
for components is to establish an approved com- 
ponent list. This is initiated at the earliest pos- 
sible engineering stage. Vendors are approved 
for a development list after it has been estab- 
lished by specification and performance analysis 
and, then, by initial survey that they can satisfy 
the requirements of development and/or product 
engineering, quality control, and purchasing. 

Actual writing of the component specification 
is the responsibility of the component application 
engineer who normally supplies purchasing with 
at least one producing source for the component. 
During the cycle of releasing a system, engineer- 
ing may aid by making purchasing aware of the 
technical state of the art so that they can deter- 
mine the names and numbers of sources required 
for a given component. 

It is, therefore, particularly important that the 
purchasing department be qualified to establish 


*This paper was presented at Thirtieth Radio Fall 
Meeting, Rochester, N. Y., October 27-29, 1958. 

f International Business Machines Corp., Kingston, 
me Y.. 


the proper criteria for multiple-sourcing of com- 
ponents on the list. 

The following criteria are suggested as the 
basis for a formal evaluation program to equip 
purchasing with data on which to advise priorities 
for sourcing evaluation. 


1) Cost of the component or its type per sys- 
tem. Estimated quantity to be used must 
be known so that a descending dollar value 
can be established. 

2) Degree of standardization or lack of stand- 
ardization resulting from: 

a) technical state of the art, 
b) proprietary rights. 

3) History of existing vendors’ delivery per- 
formance. 

4) History of existing vendors’ quality perfor- 
mance. 

5) Comparative failure analyses based on field 
failure data. Such data may be difficult to 
obtain due to the time required to make a 
significant survey; however, it is advisable 
to initiate this program where practical. 

6) Obviously, other factors, such as tooling 
cost and time, must be considered. 


When the need for a component in production 
quantity is definitely established, suggested ven- 
dors are submitted for approval to product engi- 
neering, quality control, and the engineering buy- 
er. Approvals are secured as illustrated in the 
flow chart in Fig. 1. 

In relation to the SAGE program, manufactur- 
ing engineering reviews the suggested vendor’s 
products and shipping containers to determine if 
they are adaptable to the high-speed production 
essential to SAGE manufacturing operations. If 
the product is practical for this mechanization, 
or can be brought to standards by the vendor, 
approval is granted by manufacturing engineering. 

Accurate assessment by purchasing of the 
vendor’s production facilities is a critical factor 
in determining the product quality and delivery 
performance which can be safely anticipated from 
a given source. Facility survey is a function of 
quality control, and its inspection should include: 
quality, inspection, calibration, and parts and 
record control techniques. 


20 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRIL 


REQUEST TO 
PROCESS COMPONENT 
FOR APPROVAL 


PURCHASING 


SCREEN SUPPLIER 
INITIATE & DISTRIBUTE 
COMPONENT APPROVAL 
REQUEST FORMS 


PRODUCT 


MANUFACTURING PRODUCT 
ENGINEERING 


ENGINEERING 


PURCHASING QUALITY CONTROL 


DISAPPROVAL APPROVAL 


NOTIFY ALL DEPTS Parry IF ALL COPIES 


STOP PROCESSING RECEIVED PROCESS 
PROCESS STEP A ict 


BELOW 


STEP A BELOW 


ACQUISITION 
OF ENGINEERING ENGINEERING SAMPLES 


SAMPLES HUMAGOAUOCUEREENDAMOUADERSEGOUONOUOLUUUENOUQMUUEUUENOOUOUGUUUEECOROQOUOUUCOOEOGUOGOGONGNOUOOUDUEROOOOOOOUUNOOOOOOUUCUUROEOHHOGGUL IF REQUIRED /\ 


DISAPPROVAL 


NOTIFY ALL DEPTS ee been 


: IF ALL COPIES 
Bee ahaa Ea Ge RECEIVED PROCESS 


BELOW STEP A BELOW 


APPROVAL 


UUUUALAOGAOANGUGUUEEUEDAGDOGHAOUAEOEOOEUGESDUGUANEODEGHUGOROUOECOUAUEOOOUTOOUANOOUOLONDUOOOCOGHOUDRNSUOENOUDOOUGULAGOONOON DON! 


ARRANGEMENT FOR PLANT VISITATION 


PLANT vi Ss ITAT | ON UUTUNAUEAGDUADTAUSEOAUERUGUHAEUEUOGAVOOEAOGTUAEHUAGGLOGULUORDUOEROOSUSAGGDOODAUAOOTUOCEOUONDURGOEUGESUUUOROUSEROREAOUUOEOUOHEL IF REQU IR ED A 
DISAPPROVAL APPROVAL 
NOTIFY ALL DEPTS me IF ALL COPIES 


STOP PROCESSING Paces 


RECEIVED PROCESS 
PROCESS STEP A 
BELOW 


STEP A BELOW 


ACQUISITION 
OF PRODUCTION 
SAMPLES 


PRODUCTION SAMPLES 


IF REQUIRED A 


MONCUUEERAAEUACGLOOOMUEDDOUUEUAUNOUOEOOEADURCOVOCUGUONGOUNDOUEAOUUDSOAONDOONGGUOGLUONEUNOOUUCEDOUOHERDEEOUCNDUUOEDAUONUUEOROONEOOED 


RECEIVING 
INSPECTION 


SAMPLES RECEIVED 


NOTIFY RECEIVING 
INSPECTION 


PRODUCTION SAMPLES 
FURNISHED 


COO 


LEGEND: 


SYMBOLS MEANING 


(ee etc meee ra ee ae ac | ee PROCEDURE FLOW 


STEP A 


ALL COPIES— 
COMPONENT APPROVAL 
REQUESTS RECEIVED 


mmm COMMUNICATIONS 


3 ALTERNATE 
SELECTIONS 


NOTIFICATION— 


APPROVAL 
DELETION MEETING {lm DISAPPROVAL 


DELETION 


A \ AWAITING ACTION 


DISAPPROVAL 
FILE—VENDOR 


APPROVAL 
ADD-APPROVED 


DELETION 


REMOVE-APPROVED 
COMPONENT LISTING 


FACILITY FILE COMPONENT LISTING 


Fig. 1 —Flow chart component approval. 


1960 


VENDOR DELIVERY RATING 


Approved vendors are listed on individual ven- 
dor record cards (see Fig. 2). Both the vendor 
delivery and the quality ratings are graphically 
presented on the card and are compatible in that 


VENDOR NAME ABC Company 
ADDRESS 123 4th St., New York, 
VENDOR CODE 304101 


BUYER| MAJOR PRODUCTS Capacitors 
FACILITIES - Pulse Transformers, resistors 
SIZE— NUMBER OF PERSONNEL 240 


| 1958 1959 


FINANCIAL STATUS) 


Var] 6,124 12,123 
DOLLAR VOLUME |2 QTR 


1960 1961 1962 
13,040 
KMPD BUSINESS |3 QTR 7,498 14,150 | 
4 arr{ 10,120 
Jo OF VENDOR CAPACITY 20 


al 


DELIVERY 


A— EXCELLENT 


B— AVERAGE 


C— UNDESIRABLE 


Fig. 2 —Vendor record card. 


both use the 0 to 100 scale with identical cut-off 
points for excellent, average, and undesirable. 
A ready reference is now available which the 
buyer uses to review quality and delivery per- 
formance of a particular vendor. 

The delivery rating system utilized for these 
evaluations was developed and initiated at IBM 
Military Products Division, Kingston, N. Y., to 
deal with the great number of vendors involved in 
purchasing for the SAGE computer system. 

The rating is determined by adding the total 
number of days late on each delivery and dividing 
by the total number of shipments. This gives the 
average number of days late. The resultant rat- 
ing is then classified according to a scale rang- 
ing from 0 to 100, with 88 to 100 defined as ex- 
cellent, 60 to 88 as average, and below 60 as 
undesirable. The method provides a formula 
which impartially evaluates each vendor, whether 
large or small and frequently or infrequently 
used, and graphically displays the rating so that 
trends and patterns can be immediately dis- 
cerned. 

Significantly, with use of the control, the over- 
all average monthly delivery rating jumped to 
91.7 from an initial statistical average of 75. 


VENDOR QUALITY RATING 


As Fig. 2, indicates, quality of performance 
-is also charted on the vendor record card. 


BREIDING: PURCHASING RELIABILITY 


21 


The quality rating is based on a statistical 
average rating of 75 which occurs when the 
sample fraction defectives are equal to the speci- 
fied AQL’s, with a maximum of 1 in 25 lots re- 
jected. A rating of 60 occurs when the sample 
fraction defectives equal 0.5 times the specified 
AQL’s, with a maximum of 1 in 156 lots rejected. 

Complete information on the procedure is 
available in a brochure published by the Kingston 
Military Products Division of IBM, Kingston, 
N.Y. 

As Table I indicates, the rating method itself 
equalizes rating inequalities common to some 
other rating systems. The visual method of dis- 
playing current status makes the chart under- 
standable and practical for both vendor and buyer 
reference. The buyer has the added advantage of 
being equipped with a quick and accurate visual 
history which immediately pinpoints any vendor 
trends which could affect supply or quality. The 
system has been markedly beneficial in estab- 
lishing a smoother flow of components from the 
vendor and in preventing production crises that 
interrupt the procurement of products of re- 
quired reliability. 


VENDOR SUPPORT 


It is advisable that vendors be fully acquainted 
with the philosophy of the rating systems to re- 
alize maximum effectiveness from the program. 
The IBM Kingston Procurement and Quality Con- 
trol departments jointly sponsored a program to 
acquaint 150 of their suppliers with the purpose 
of the controls and methods used for measuring 
their performance. Engineering-type seminars 
and demonstrations are held on a regular basis 
to assist the vendor in satisfying quality and per- 
formance requirements and to improve vendor 
products and relations. 


OTHER RELIABILITY DATA 


Usually, the responsibility for component 
reliability is associated with engineering, as it 
concerns circuit design, component evaluation, 
and maintenance techniques. Numerous reports 
covering these engineering aspects of reliability 
have been made, including a trilogy entitled, 
‘‘Reliability of an Air Defense Computing 


HG, Harding and J, Rowinski, ‘‘Vendor Quality 
Rating Procedure, IBM Corp., Kingston, N.Y., Brochure; 
Mareh 15, 19572 


22 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


TABLE I 
COMPARISON CHART 


APRIL 


Factor 


Rating of 100 


Effect of quantities 
submitted upon 
rating 


Standard 
deviations 


Rating limits 


Mathematics 
required 


Interpretation of 
ratings 


System.’” 


The Delivery Performance System 


Other Rating Systems 


Rating of 100 can be obtained if defectives 
are present in Sample and sometimes may 
not be obtained even if no defectives are 

found. 


Vendors submitting small lots may de- 
serve, and not obtain, high ratings be- 
cause ratings are affected by lot size. 


Standard deviations are accepted as 
criteria and involve lot size. 


Rating limits are relative to standard 
deviations, instead of being specific 
values known to the vendor. 


Calculations involve a combination of 
formulas and complex procedures. 


Ratings are influenced by factors which 
the vendor cannot control. They cannot 
be used by the vendor as an accurate 

guide for improving his product. 


described in this paper has also been reviewed in 


the literature. ? 


KMPD Rating System 


Rating of 100 is obtained only if 
no defectives are found in 
sample. 


Vendors submitting lots with few 
defectives are given high ratings 
regardless of quantity. 


ACL values, to which the vendor 


adjusts the quality of his materi- 
al, are used in place of standard 
deviations. 


Rating limits are based on 0.5 
and 1.6 AQL’s and are easily 
understandable to the vendor. 


Calculations involve only sample 
results and a corresponding con- 
stant for each AQL, as needed. 


Ratings reflect only the actual 
quality of the product. The ven- 
dor can rely on accuracy of the 
ratings. 


CONCLUSIONS 


2H. F. Heath, Jr., ‘‘Reliability of an air defense com- 
puting system: component development,’’ IRE TRANS. 
ON ELECTRONIC COMPUTERS, vol. EC-5, pp. 224-226; 
December, 1956. 

R. E. Nienburg, “‘Reliability of an air defense com- 
puting system: circuit design,’? IRE TRANS. ON ELEC- 
TRONIC COMPUTERS, vol. EC-5, pp. 227-233; Decem- 
ber, 1956. 

M. M. Astrahan and L. R. Walters, ‘‘Reliability of an 
air defense computing system: marginal checking and 
maintenance programming,’’ IRE TRANS, ON ELEC- 


Procurement Departments must, in the future, 
make full use of measurement and control tech- 
niques to assure that their selection of vendors 
has assisted the over-all reliability program. 
The control system described in this paper has 
demonstrated several distinct advantages in 
operation that recommend its use. 


TRONIC COMPUTERS, vol. EC-5, pp. 233-237; Decem- 
ber, 1956. 

3« Rating system pinpoints delivery and quality 
trends,’’ Purchasing News, July 28, 1958. 


1960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


23 


DIAGNOSIS OF EQUIPMENT FAILURES* 


J. D. BRULE,t R. A. JOHNSON, and E. J. KLETSKY+* 


Summary—This paper introduces several new 
concepts which are applicable to the problem of 
diagnosis of equipment failures. Following the 
definitions of an equipment, an element of the 
equipment, and the model of a test, a general dia- 
gram of a testing procedure is developed. The 
testing diagram is constructed in such a way that 
the various tests needed and the probability of 
failure of the elements are readily incorporated. 
While it is found that a completely general testing 
diagram becomes quite complicated even when the 
equipment under consideration is not intricate, a 
major simplification is obtained by introducing a 
simplified diagram with suitably restricted tests. 
This simplified testing diagram may be used re- 
peatedly in order to find all the faulty elements of 
the equipment. 

With reference to the testing diagram, it is 
possible to compute the minimum average cost of 
diagnosing the equipment. This appears to be the 
most useful measure of the efficiency of a test 
procedure. The order of magnitude of this opti- 
mization problem is discussed and solutions for 
two special cases are obtained by analogy with an 
optimum coding problem. 


INTRODUCTION 


Recent advances in the design of electronic 
systems have resulted in an ever-increasing com- 
plexity in such systems. While such complexity 
is necessary to permit the systems to perform 
tasks of ever-increasing scope, the problem of 
keeping the systems in working order tends to in- 
crease at least as fast as the basic complexity. 
The present investigation is concerned with one 
particular phase of the problem of maintenance— 
namely, the problem of determining which part of 
an equipment is in need of repair when the equip- 
ment as a whole does not function properly. While 
this is only one aspect of the whole problem of 
maintaining equipment in working order, any im- 
provement in general diagnostic procedures 
would result in significant economies. 

Some of the other factors which contribute to 


*This work was partially supported by Rome Air 
Dev. Center under contract AF 30(602)-1833. 

f Electrical Engrg. Dept. Syracuse University, 
Syracuse, N. Y. 


maintainability are the reliability of the basic 
component parts, proper preventive maintenance 
procedures and proper mechanical and electrical 
design so as to make component parts easily ac- 
cessible for replacement. Despite the advances 
in component reliability and preventive mainten- 
ance, it appears to be the ‘‘nature of the beast’’ 
that a great many unpredictable or random fail- 
ures do occur. It is for such failures that the im- 
portance of efficient diagnostic procedures are 
important. Equipment logs indicate that a signifi- 
cant part of the ‘‘down time”’ of equipments is 
spent in diagnosing what part of the equipment 
does not function properly. Such records also 
show that incorrect diagnoses occur frequently. 
For example, tubes which have been replaced as 
defective are often found to be perfectly accept- 
able. 

The present work was undertaken in the hope 
that examination of the basic fundamentals of 
diagnostic procedures would permit substantial 
economies to be made in the diagnostic phase of 
maintenance. The need for systematized proce- 
dures is increased by the fact that in many cases 
equipment must be maintained by semiskilled 
technicians whose training does not enable them 
to understand completely the functioning of the 
complicated machines which they are to maintain. 
We do not propose that any results we obtain 
would necessarily aid a highly skilled technician 
with adequate experience on the equipment in 
question. However, when the training or experi- 
ence is inadequate, systematized procedures 
should result in economies. Another factor which 
favors systematized diagnostic procedures is the 
present trend towards automatic equipments and, 
in particular, equipments which are capable of 
diagnosing failures of their own component parts. 
It is axiomatic that any function which is to be 
performed by a machine must first be systema- 
tized. 

A review of the technical literature yields a 
surprisingly small number of references to the 
general problem of efficient or optimum diagnos- 
tic procedures. A recent paper by Hoehn and 
Saltz [1] which summarizes previous work gives 
only two references [2], [3]. Hoehn and Saltz 
discuss two approaches to efficient diagnosis. 
Diagnostic procedures which are applicable to 
specific equipments have been published (usually 
in the form of maintenance manuals, many of 


24 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


which are classified) but the basic aspects of the 
diagnostic problem are then hidden by the con- 
straints of the specific equipment. A recent 
paper [4] on the identification of the various 
types of malfunctions of a syncro-repeater sys- 
tem is an example of this. 

A problem related to the diagnosis of equip- 
ment failures is the initial check-out or ‘‘debug- 
ging’’ of equipments. Although this problem is 
not considered in detail here, many of the tech- 
niques developed for diagnosis are also applicable 
to check-out. 


MATHEMATICAL MODELS 


In order to apply mathematical techniques to 
the problems encountered in the testing of equip- 
ment, it is necessary to formulate general mathe- 
matical models of an equipment and a test proce- 
dure. If such models are to be applicable to a 
large class of equipments, they must necessarily 
be somewhat abstract. Accordingly, one should 
not expect that the models contain all the details 
peculiar to a given equipment or a given test pro- 
cedure. However, the models must be capable of 
representing the essential features and organiza- 
tion of virtually any equipment and the various 
procedures by which it can be tested. 


Model of an Equipment 


As a start toward developing a model of an 
equipment, the following definitions are made: 


Equipment: For test purposes, an equipment 
is defined as a collection of functional ele- 
ments which are interconnected so as to 
generate specified responses on the appli- 
cation of a specified set of primary stimu- 
lants. The primary stimulants are inde- 
pendent of the operation of the equipment. 

Functional Element: A functional element is 
defined as a constituent part of an equip- 
ment which generates a single response on 
the application of a specified combination 
of stimulants which may include the re- 
sponses generated by other functional ele- 
ments. * 


These two definitions implicitly define a con- 
venient diagram of an equipment which is to be 
the subject of a testing procedure. An example 


*m common engineering terminology the responses 
are ‘‘outputs’’ and the ‘‘stimulants’’ are the ‘“inputs.’’ 
The terminology used is intended to be as general as 
possible. 


APRIL 


Fig. 1 —Diagram of an equipment. 


of such a diagram is shown in Fig. 1. In this dia- 
gram the functional elements are denoted by low- 
er case letters and the responses of the elements 
by the corresponding upper case letters. Si and 
Se represent the primary stimulants to the equip- 
ment as well as the stimulants to elements a and 
c, respectively. E and D represent the re- 
sponses of the equipment as well as the responses 
of elements e and d, respectively. For a com- 
plete specification of the equipment, it is neces- 
sary to specify which combinations of stimulants 
are required for each element to generate the 
specified output. In the simplest case, the speci- 
fied response is generated only when all of the 
stimulants to that element are present and the 
element in question is functioning properly. In- 
spection of Fig. 1 shows that with this assumption 
the specified equipment responses (E and D) 
will not be present unless S; and Seg are pres- 
ent and all of the elements are functioning 
properly. 

The definitions of equipment and, particularly, 
functional element given above are not sufficient- 
ly detailed to determine completely the descrip- 
tion of an equipment for test purposes. This 
freedom in the designation of the elements is 
necessary if the model is to be useful in repre- 
senting actual diagnosis procedures. For ex- 
ample, in a military situation, a large system 
may include radars, computers, and guns. The 
whole system may be described to be the ‘‘equip- 
ment’’ and the radars, computers, and guns to be 
the ‘‘functional elements.’’ This description 
would be appropriate for the personnel respon- 
sible for the actual control of the system. For 
such personnel the diagnosis of a malfunction of 
the system is merely the determination of which 
radar, computer, or gun is not functioning prop- 
erly. If it is found that the radar is at fault, the 
responsibility for further diagnosis is shifted, 
for example, to a radar technician. To the tech- 
nician the radar is the equipment rather than the 
functional element and the transmitter, receiver, 
sweep generators, etc., are the functional ele- 
ments. Having located the fault, say, to the re- 
ceiver, the receiver becomes the equipment and 
the IF amplifier, local oscillator, video 


1960 


amplifier, etc., become functional elements. Thus 
the diagnosis proceeds through a number of lev- 
els. The model of an equipment that is intro- 
duced is applicable to each level by proper des- 
ignation of ‘‘equipment,’’ and ‘‘functional ele- 
ment,’’ and the diagram of an equipment at each 
level follows when the functional elements have 
been identified and the stimulants and responses 
specified. 

Two possible sources of complication which 
are inherent in the definitions of equipment and 
functional element have not been included in the 
simple example of Fig. 1. In the first place, 
there are no loops in the diagram such that any 
response depends on itself through feedback. An 
equipment with feedback may be tested by break- 
ing the loop and treating the response at the 
break as an additional response and primary 
stimulant. 

The second complication which may arise in 
the diagram of an equipment is the presence of 
redundancy. This problem is discussed briefly 
by Johnson, et al. [7], where alternative means 
for testing redundant systems are presented. 
One can determine if all paths are functioning by 
breaking all the redundant paths and replacing 
them in turn or by designating the output of each 
redundant path as an equipment response. Alter- 
nately, one might add an element to the equip- 
ment which would have an output only when all 
redundant paths are functioning. 


Model of a Test 


The diagram of an equipment introduced above 
implicitly defines the tests which can be per- 
formed in diagnosing the equipment for the pur- 
pose of determining which element, if any, is not 
functioning properly. Such tests consist of sup- 
plying stimulants to the elements and observing 
the responses. For example, (with reference to 
Fig. 1) if stimulants A and C are applied to 
element b and the response at B meets the 
specifications, element b is known to be func- 
tioning properly. Similarly, if Si: and Sz are 
applied and response B satisfies the specifica- 
tions, we can infer that elements a, b and c 
are functioning properly. 

In order to avoid repetition of the phase 
‘functioning properly,’ the following terms shall 
be used in the report: 


‘‘Good’”’ element: an element which functions 
properly. When the specified stimulants 
are supplied to the element, the specified 
response from the element is developed. 

‘‘Bad’’ element: an element which does not 
function properly. When the specified 


BRULE, JOHNSON, KLETSKY: DIAGNOSIS OF EQUIPMENT FAILURES 


25 


stimulants to the element are supplied, the 
specified response from the element is not 
developed. 

**Questionable’’ element: an element which is 
not known to be good. 


In these terms, a diagnosis test procedure 
consists of identifying the bad elements, if any, in 
a set of questionable elements. Similarly a 
check-out test procedure consists of determining 
whether all elements are good which, in turn, 
assures that the equipment is functioning proper- 
ly. In most test procedures the determination of 
which questionable elements are, in fact, good or 
bad is made by logical deduction from the results 
of a number of tests. 

Any given test will pass if the elements being 
tested are all good, and will fail if one or more 
of these elements are bad. For an equipment 
with N elements, a possible representation (and 
also designation) of a test is a sequence of N 
symbols, one symbol for each element. A given 
test examines a subset of k of these elements to 
determine if they are good and ignores the re- 
maining N-k elements. The symbol 0 is as- 
signed in the jth position of a test if the jth 
element of an equipment must be good in order 
for the test to pass. The symbol 1 willbe 
placed in the jth position of a test if the jth ele- 
ment is not tested. 

This notation makes it possible to construct a 
complete list of all the tests which can be per- 
formed on an equipment for which an equipment 
diagram is available. Table I contains such a 
list for the example shown in Fig. 1. All of the 
tests are performed by applying a set of stimu- 
lants as listed in the second column and observ- 
ing the response indicated in the first column. 
The third column lists the elements which must 
be good if the specified response is to be ob- 
served. The numerical designation of the test is 
given in the last column. The digits of the test 
designation are 0 if the corresponding element 
must be good in order for the response to be 
within specifications (i.e., for the test to pass). 
The total number of different tests which can be 
defined for an equipment with N elements is 
obviously equal to the number of N-digit binary 
numbers. However, the test which has a numeri- 
cal designation 111..... 1 is of no significance 
since the result is independent of whether the 
elements are good or bad. Accordingly, the total 
number of significant tests is given by 


oNset 


For the example (Fig. 1) only the 19 tests listed 
in Table I are realizable in accordance with the 


26 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


TABLE I 


Tests Associated with the Equipment of Fig. 1 


Good 
Elements 
to Pass 


Numerical 
Designation 


abcde 
01111 
10111 
11011 
11101 
11110 
00111 
10011 
00011 
11001 
10110 
00110 
10010 
00010 
11100 
11000 
10100 
00100 
10000 
00000 


Stimulants 
Required 


Response 
Observed 


ARR RARPRR AHP POWeWwWhPoawpS 


definition of a test. The remainder of the 31 
tests can only be realized as combinations of 
these 19 tests. For example, the result that the 
tests 11001 and 10011 both passed (or both failed) 
is equivalent to knowing the result of test 10001. 
In either case the conclusion is that elements b, 
c, and d are all good (or at least one has 
failed). 

A somewhat different type of test, not neces- 
sarily included in the above tabulation, is a test 
that is made by replacement. In a replacement 
test, a questionable element is replaced by an 
element that is known to be good, various stimu- 
lants are applied and a response is observed. 
Two such tests are listed below for the example 
of Fig. 1. 


Good Numerical 
Replace Observed Stimulants Elements Designa- 
Element Response Required to Pass tion 
@ E s,8, abde 00100 
b E Ss acde 01000 


The test 00100 is equivalent to one on the list, 
while the test 01000 is different. However, this 


APRIL 


example shows that the proposed designation of a 
test is capable of representing replacement tests. 
As a final comment, note that the first five 
tests listed in Table I are single element tests in 
that they have but one 0 in their designation. As 
such, failure to pass one of these tests provides 
unambiguous information that the corresponding 

element is bad. 


Testing Procedures 


Testing procedures may be organized in two 
essentially different ways. In the first, a number 
of tests are performed and the results analyzed 
to determine which elements are good and which 
are bad. Note that in this case the order of per- 
forming the tests is immaterial since all the 
tests are done before analysis is attempted. Such 
a procedure will be termed ‘‘combinational’’ in 
that the analysis depends on the combination of the 
results of the tests. If it is assumed a priovithat 
only one element or a small fraction of the ele- 
ments are bad, such a procedure is inefficient in 
that it requires that all the tests be done in every 
testing procedure. In an alternative procedure, 
the choice of the next test to be done is dependent 
on the results of the previous tests. Such a pro- 
cedure will be called ‘‘sequential’’ in that the 
analysis of the test results is carried out sequen- 
tially. Most diagnostic procedures are sequential 
in nature and essentially consist of localizing a 
bad element. Check-out (check-up in medical 
terminology) testing procedures, however, are 
not sequential in that a number of tests are re- 
quired and the order of performing the tests is 
not important. The representation of a test pro- 
cedure developed below is directly applicable to 
the sequential procedure since our primary 
interest is in diagnosis. 

A convenient method of stating the cumulative 
information’ gain from the results of several 
tests is to define ‘‘information states’’ in which 
the status of each element is given as good, bad, 
or questionable. This concept enables one to 
draw a diagram of a sequential testing procedure 
in which new information states are generated 
from previous states and the results of interven- 
ing tests. The basic building block in such a 


7Here, ‘‘information’’ is used in the qualitative 
sense. A quantitative development of this concept is 
presented in a companion paper by R. A. Johnson, ‘‘An 
information theory approach to diagnosis,’’ to be pre- 
sented at Sixth National Symposium on Reliability and 
Quality Control, Washington, D, C., January 11-13, 
1960. Abstract, this issue, p. 35. 


1960 


C) (Test Passed) 
eS Test Failed) 


Fig. 2 —Basic building block of a sequential test diagram. 


Test 
Gy 


diagram is shown in Fig. 2. In general, a com- 
plete testing procedure results in a single final 
state in which the status of each of the N ele- 
ments is known to be either good or bad. Since 
such a state can be specified by an N-digit binary 
number, it is evident that there are 2N final 
states possible. The number of building blocks 
(Fig. 2) and, consequently, the number of tests 
which appear in a complete diagram is 2N - 1. 
Thus, even for a small number of elements, the 
diagram of a complete sequential test procedure 
becomes quite complicated. 


A Simplified Sequential Test Procedure 


A considerable simplification of the diagram 
of a diagnostic testing procedure is obtained if 
the assumption is made ‘‘that one and only one 
element is bad.’’ Such a diagram will be referred 
to as a simplified sequential test diagram. In this 
diagram the number of final states is equal to the 
number of elements N since any one element 
may be bad. Also, the diagram has on it exactly 
N - 1 test vertices. This is seen by noting that 
at any state in the testing diagram where there 
are N questionable eléments, a single test will 
divide the N elements into two subgroups of Ni 
and Nz elements, where N=Ni+Ne2. Fora 
test to be nontrivial, both Ni and Nz must be 
equal to or greater than one. This property of a 
test can be shown quite simply by representing the 
N elements as a sequence of N symbols ona 
line, as shown in Fig. 3. In this figure, test Ti 
divides the N= 5 elements into two subgroups of 
N, = 2 and Nz = 3 elements. These two sub- 

- groups must each be subdivided until, as stated 
above, the original group of N elements is 


Ti 


Fig. 3 —Symbolic representation of a test. 


BRULE, JOHNSON, KLETSKY: DIAGNOSIS OF EQUIPMENT FAILURES 27 


divided into N subgroups. This is accomplished 
when a line representing a test is placed between 
each symbol representing an element. There are 
N - 1 intervals between the N symbols; thus 

N - 1 tests must appear on the testing diagram. 
Note that this simplified sequential test diagram 
can be used over and over to find each of the bad 
elements provided that the presence of more than 
one bad element does not make it impossible to 
find at least one. 

Referring back to the equipment defined by 
Fig. 1 and the notation previously introduced, the 
simplified diagram of a sequential diagnostic test 
procedure is shown in Fig. 4. Here the initial 
state has been designated 11111 since at this 
state each of the elements is questionable. If the 
first test is passed, elements c and d are 
known to be good and therefore this state is des- 
ignated 11001. If the test fails, c or d is bad 


"bad! 


"baa! 
‘bad! 
Fig. 4 —A simplified sequential test diagram. 


and, by the assumption that only one element is 
bad, a, b and e must be assumed good. There- 
fore, the state designation becomes 00110. 

A test following state 11001 may be any test 
which separates elements a, b, and e. The 
tests T11100» T11000, T11010, and T11110 
are all equivalent here since c and d are known 
to be good before the test. From this it is seen 
that the structure of the testing diagram and as- 
signment of the final states does not uniquely 
determine the tests which must be used. 

Since much of the remainder of the paper is 
concerned with analysis of simplified sequential 
testing diagrams of the type illustrated in Fig. 4 
a list of the important properties of such dia- 
grams is of interest. Some of these properties 
are: 


1) There is one initial state designated 
Tiles 

2) There are N final states where N is 
the number of elements in the equip- 
ment. The final states are character- 
ized by a designation which includes 
one ‘1’ and N - 1 ‘0’s. 

3) There are a total of N- 1 “test 


28 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


vertices’’ required independent of the 
structure of the diagram. 

4) The same test (as determined by the 
test designation) may appear on more 
than one path from the initial to a final 
state. (The case of purely combination- 
al testing has the same tests on all paths 
from initial to final states. Thus the 
model includes combinational testing.) 

5) The same test does not appear more 
than once on the path from the initial 
state to a given final state. (This con- 
dition insures that the same test is 
never repeated in diagnosing a given 
failure.) 

6) The designation of the new state created 
when a test passes is obtained by multi- 
plying the designation of the previous 
state and the test designation, digit by 
digit. By convention this state appears 
on the upper branch leaving the test 
vertex. 

7) The designation of the new state created 
when a test fails is obtained by multi- 
plying the designation of the previous 
state and the complement of the test 
designation, digit by digit. By conven- 
tion this state appears on the lower 
branch leaving the test vertex. 


An examination of the maximum number of 
different tests which can exist for N elements 
for the simplified sequential test procedure will 
now be made. As noted above, there are a total 
of 2NN digit binary numbers, but the test con- 
sisting of all 1’s is trivial. In addition, the as- 
sumption that one element is bad eliminates the 
test consisting of all 0’s. Also, this assumption 
means that only half of these 2N - 2 tests are 
useful. This is because a test T and its comple- 
ment T', (0’s and 1’s interchanged) supply the 
same information about the state of the equip- 
ment. Consequently, the number of useful (essen- 
tially different) tests is now 2N - 1 - 1, 

It must be recognized that, in general, not all 
of these 2N - 1 - 1 tests may be possible. For 
example, if the equipment consists of a cascade 
of 4 elements, a single test cannot be performed 
which will determine if the first and third ele- 
ments alone are good. Consequently, for a given 
set of m tests, it is necessary to establish if 
these tests are adequate to determine the bad 
element out of the N given elements. 

A method for determining the adequacy of a 
given set of tests is illustrated in the following 
example. Consider the set of m = 6 tests for 
N = 5 elements: 


APRIL 
T, = 10010, 
T, = 01110, 
T; = 00101, 
T, = 10110, 
T; = 10101, 
Ts = 11000. 


The problem is to check these tests to establish if 
element 2, for example, can be isolated by any 
sequence of tests. Rewrite each test, or its com- 
plement, in such a way that there is a 0 in the 
second position and then add the 1’s in each col- 
umn as shown below. 


T, = 10010, 
T! = 10001, 
T, = 00101, 
T, = 10110, 
T; = 10101, 
Te = 00111 
40434 . 


The set is adequate to test element 2 if the 2nd 
column is the only one having the sum zero. Note 
that, in this example, it is not necessary to re- 
write any of the tests after the third since at this 
point it is obvious that none of the columns, other 
than the second, will have a sum of zero. This 
procedure must then be repeated for each of the 
elements. It can be seen that this algorithm is a 
valid method for determining adequacy by noting 
that it merely replaces the process of forming 
products to determine states by the process of 
addition. Thus, this checking procedure provides 
a rapid means for determining the adequacy of a 
set of tests when it is known that exactly one ele- 
ment is bad. 

An alternative representation of a sequential 
test procedure in terms of Boolean matrices is 
given in Johnson, et al. [7]. The representation 
is completely equivalent to the simplified sequen- 
tial testing diagram described above. As far as 
can be determined, the matrix representation does 
not provide any advantage over the testing dia- 
gram. 


Restriction of the Simplified Test Procedure 


Following the testing diagram defined above is 
a sufficient procedure for locating which element 
has failed provided the assumption that ‘‘only one 
element is bad’’ is satisfied. Consider now the 
restrictions and modifications which must be 


1960 


introduced if this assumption is not met. First 
consider the case where no elements are bad. In 
this case the results of each test will be ‘‘pass’”’ 
and, with the convention indicated in Fig. 2, the 
upper branch from each test is followed. This 
path leads to a specific element es (element b 
in the example of Fig. 4) which, under the as- 
Sumption that one element is bad, is known to be 
bad. Note that the deduction that this element is 
bad is a consequence of the fact that all the other 
elements are known to be good as a result of the 
tests performed and the assumption that one ele- 
ment is bad. If this assumption is removed, the 
status of element e¢ is still questionable since 
none of the preceding tests required that eg be 
good in order to pass. Before the remaining am- 
biguity can be removed, a test must be performed 
which will pass only when eg (or eg and other 
elements which are known to be good) is good. 
Thus the possibility that no elements are bad and 
the machine as a whole is good can be taken care 
of by a slight modification of the simplified test- 
ing diagram which contains the implicit assump- 
tion that one element is bad. An illustration is 
given in Fig. 5. Note that in this case, the test 
diagram has exactly N test vertices. 

A second possible source of difficulty arises 
when more than one element is bad. Here, it is 
sufficient to insure that the presence of addition- 
al bad elements will not lead to the identification 
of any of the good elements as bad. If this con- 
dition is met, all the bad elements can be identi- 
fied by repeated application of the same testing 
procedure after repairing each bad element as it 
is identified. The restrictions imposed by this 
condition are established below. 

In the assignment of binary numbers to repre- 
sent the states in the testing diagram, certain of 


| e is 'bad' 
(ab *questionadle') 

(cd 'good') 
d is 'bad' 


(c 'good’) 


c is 'bad' 


‘Fig. 5 —Modified simplified testing diagram including restrictions on the tests. 


ais 'bad' 


(abe 'questionable') 


| abde (‘questionable’) 


BRULE, JOHNSON, KLETSKY: DIAGNOSIS OF EQUIPMENT FAILURES 29 


the 0’s have been inserted by deduction using the 
assumption that one element is bad and the failure 
of certain tests. Actually, the status of the ele- 
ments corresponding to these 0’s is questionable 
Since the tests involved are tests of the remain- 
ing elements which are known (from the test re- 
sults) to contain one bad element. The question- 
able status of the elements which have been 
labeled good by use of the assumption will not 
influence the results of the remaining tests pro- 
vided that these elements are not required to be 
good in order that any of the remaining tests pass. 
In other words, when a failure has been localized 
to one group of elements, further tests should not 
involve elements which are not in this group. The 
restriction on the allowable tests is illustrated in 
Fig. 5. In this figure the 0’s which are assigned 
by making use of the ‘‘one element is bad’’ as- 
sumption are underlined (0) and the restriction 
on the following tests as indicated by underlining 
those 1’s in the test designation which insures 
that the 0’s will not influence the results of the 
test. It is evident that the following condition is 
sufficient to insure that the presence of more 
than one bad element will not interfere with the 
determination of one bad element: 

Each test shall contain a 1 in the position 

corresponding to any 0 of the preceding 

state which has been determined as the re- 

sult of the failure of a previous test. 

The X’s appearing in the test designations of 
Fig. 5 indicate that the choice of lor 0 is complete- 
ly immaterial since the corresponding elements 
are known to be good as a reSult of previous tests. 
This may be summarized in the statement. 

The digit of the test designation in the position 
corresponding to any 0 of the preceding state 
which has been determined 
as the result of the passing 
of a previous test is arbi- 

trary. 


The number of distinguish- 
able tests is 2N-1 in this 
more general case since the 

‘gooa') assumption that 1 element 
is bad is no longer made. 

Once again, not all of 
these tests will be possible, 
and a method is now estab- 
lished for determining if a 
given set of M< 2N-1 tests 
is adequate to determine 
which elements, if any, of a 
given equipment are bad if 
the testing diagram is con- 
structed according to the 
above rules. Such a set of 


b 'bad' 
b 1s ‘questionable’ ) 
good' 


30 


tests will be called ‘‘adequate in general.’’ 

Considering the situation when each test fails, it 
is seen that eventually a state consisting of (N -2) 
0’s and two 1’s is reached, and all of the 0’s will 
be underlined. In order to determine which of the 
two elements are bad, the next test must have l1’s 
in each of the underlined 0 positions of the pre- 
ceding state and exactly one 1 in one of the re- 
maining two positions. Thus, a necessary condi- 
tion for a set of tests to be adequate in general is 
that the set must have at least one test which has 
(N-1) 1’s and one 0. If this conditions is satisfied, 
then it is possible to isolate this element if it is 
bad. Hence, we can eliminate this element from 
each test, and we are left with a modified set of 
tests for N-1 elements. In order for this modified 
set to be adequate in general it, too, must contain 
at least one test which has exactly one 0 in its 
designation. A repeated application of this check 
supplies a necessary and sufficient condition for a 
given set of tests to be adequate in general. For 
example, consider the set of tests, 


Ti, 31 1001, 
Tee sl 00di1, 
Teak tO 1. 
Te00111, 


Since Ts has only one 0, and it is in the third 
position, element 3 can be eliminated, anda 
modified set of tests formed. This set is: 


rere 110% 
Tesh i 
TH e=0011, 


Two additional applications of the check shows 
that the original set is adequate in general. Note 
that a set of tests may be adequate, but not ade- 
quate in general. For example, 


T,2=48100; 
Peo=0214% 
T; = 0001 


is a set which has this characteristic. 


Probability of Failure 


The introduction of information states in the 
testing diagram permits a more detailed specifi- 
cation of the status of the equipment than the 
simple specification of which elements are good, 
bad, and questionable. If sufficient data from the 
maintenance history of the equipment is avail- 
able, the a priori probability that a failure is 
caused by a given element can be computed sta- 
tistically. If such data is not available, the a 
priori probability of failure of an element can be 
estimated from the reliability of the component 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


parts of the element. In the absence of either of 
the above, an educated guess as to the a priori 
probabilities of failure may well give a better 
specification of the status of the machine than the 
simple specification of which elements are good, 
bad, or questionable. 

When the diagnostic problem is considered in 
terms of probability of failure, the influence of 
other factors can be introduced. The presence of 
one or more ‘‘symptoms’”’ modifies, in effect, the 
probabilities of failure associated with the initial 
state. Thus, any symptoms concerning the be- 
havior of the machine will influence the efficiency 
of a given testing procedure. If the symptoms in- 
dicate that element ‘a’ is almost certainly bad, 
any procedure which does not test element ‘a’ 
first will be inefficient in the presence of these 
symptoms. It should be emphasized that the cap- 
ability of isolating bad elements by a given pro- 
cedure is independent of any data on the proba- 
bility of failure. However, such data can be used 
in the comparison of the efficiency of various 
diagnostic procedures. 

The introduction of probability concepts also 
permits a generalization of the concept of a test. 
Previously, only tests which give unambiguous 
results were considered. Essentially this as- 
sumes that the testing equipment itself is not 
subject to failure or misinterpretation. Alter- 
natively, the test may be considered as an opera- 
tion which modifies the probabilities which speci- 
fy the status of the elements. With this concept 
of a test, it is possible to incorporate the proba- 
bility that the testing equipment is not perfectly 
reliable. This generalization is not considered 
in the remainder of this paper. 


Cost 


It is evident that there are a large number of 
possible testing procedures which can be defined 
for a given equipment. These procedures can 
only be compared if a relative rating which is in- 
dicative of the ‘‘cost’’ of the procedure is defined. 
Here cost is to be interpreted in the general 
sense so as to include the costs of man hours, 
test equipment, loss of equipment availability, 
etc. In the simplest model of a testing procedure, 
a cost can be associated with each of the possible 
tests. In this case, the cost of locating a particu- 
lar bad element.is the sum of the costs of the 
tests along the path which leads from the initial 
state to the final state indicating that this element 
is bad. Note that in this simple model, the cost 
of a given sequence of tests is independent of the 
order in which the tests are performed. Consid- 
eration of actual equipments indicates that very 


1960 


often the cost of (i.e., time required for) per- 
forming a given set of tests varies greatly as the 
Sequence is changed. However, in the present 
paper, a cost is assigned to each possible test in- 
dependent of what tests have been done previ- 
ously. 


Optimization Criteria 


The most obvious criteria to apply in selection 
of the “‘best’’ of several diagnostic procedures is 
minimum average cost. In terms of the ideas 
previously introduced the average cost, C may 
be written as 


where pj is the a priori probability of failure 
associated with the jth element, that is, the prob- 
ability, before any diagnosis, that the failure of 
the equipment is caused by the failure of the jth 
element. Qj is the cost of all of the tests which 
are necessary to isolate element j as bad in the 
diagnosis procedure under consideration, that is, 
the cost of determining that j is bad. If this cost 
is simply the sum of the costs of the tests we may 


write K, 
A= DiC » 
k=1 


where Cj is the cost of test k which appears in 
the path from the initial state to the final state 
indicating that element j is bad. The minimum 
average cost criteria appears to be the most sig- 
nificant in most diagnostic procedures and is 
used almost exclusively in the remainder of this 
paper. 

Another possible criteria is the so-called 
min-max criteria. With this criteria, the diag- 
nostic procedure for which the maximum cost 
(maximized over the elements) is less than for 
all other procedures is selected as the best. With 
reference to previous notations, the best proce- 
dure in the min-max sense is obtained by com- 
puting the maximum Q. for each procedure and 
selecting that procedure for which the 4j max SO 
obtained is smallest. In formal mathematical 
language, the best procedure in the min-max 
sense has a (maximum) cost of 


— 4 r . : 
Cmax = (hj max over j) min over r » 


where Qf is the cost of determining element j 
is bad in the rth procedure. 


BRULE, JOHNSON, KLETSKY: DIAGNOSIS OF EQUIPMENT FAILURES 


PROBLEM COMPLEXITY AND SOLUTION 
OF SPECIAL CASES 


In this section a detailed examination of the 
consequences of the definition of a test and the 
statement of the optimization problem is made. 
The work that follows is restricted by the as- 
Sumption that only one element is bad. Some of 
the properties of the testing diagram are exam- 
ined and the solution of special cases is indi- 
cated. 


Problem Complexity 


As a first project, it is desired to determine 
the number of testing diagrams that exist for 
various cases. First assume that there are N 
elements in the equipment, that exactly one ele- 
ment is bad, and that all of the 2N-1-1 tests 
can be performed. The problem is to determine 
how many different testing diagrams can be con- 
structed for this case. This number is desig- 
nated as 6(N). By considering the number of 
diagrams possible for any K<N, and the num- 
ber of ways in which K items can be picked 
from a group of N, a lower bound on @(N) can 
be developed and is 


N-1 
8(N) min = i a KIN SET 6(K) @(N - K). (1) 
K=1 


6(N)min is plotted in Fig. 6 for values of N up 
to 8. A simpler form for 6(N)min can be ob- 
tained by rearranging (1) into the form 


2% min_ 1 5 ato O(N =< K) 2 
N! 2 1 K! (N-K)!. 
or 
N-1 
F(N) = 3 F(K) F(N - K), 
K=1 
where 


6(N)min = N! F(N) . 


Fig. 6 contains a plot of F(N) vs N. This curve 
shows that logy F(N) approaches a constant 
slope, indicating that F(N) behaves as C,()N 
for large N. From Fig. 6, 8 = 1.78. Thus, 
(N) min behaves as C,N! (1.78)N for N> 10. 


The computation of 6(N) includes, for any N, 
all the distinct diagrams obtained by interchang- 


31 


32 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Fig. 6 —Some measures of the magnitude of the 
optimization problem. 


ing the elements on a given structure. The num- 
ber of different structures, without regard to the 
location of specific elements, also grows expo- 
nentially with N and is plotted as [(N) in Fig. 
6. In general, [(N) behaves as C2(2.2)N for 
large N. 

In some cases examined below, attention will 
be focused on those structures which isolate ele- 
ments after different numbers of tests. That is, 
it is possible for two testing diagrams to have 
different structures, but each will isolate 2 ele- 
ments, say, in 2 tests and 4 more elements in 3 
tests. The number of such different testing dia- 
grams is denoted as X(N). Fig. 6 contains a 
plot of X(N) vs N. For large values of N, 
log X(N) is very nearly a straight line, indica- 
ting that X(N) behaves as C2(1.84)N for 
N> 10. While X(N) is considerably smaller 
than 6(N) fora given N, X(N) is still quite 
large when it is noted that X(30) is approx- 
imately 9x 10°. 

The above results lead to the obvious conclu- 


APRIL 


sion that it would be impractical to attempt to 
evaluate the testing system having the minimum 
average cost by the computation of the average 
cost associated with each. For example, with 
N= 10, there exist about 3.44 x 10” testing dia- 
grams. Assuming that each average cost can be 
computed in 0.125 second, this amounts to a total 
time of 500 days to compute the cost associated 
with each of the @(10) testing procedures. Con- 
sequently, it is apparent that it is necessary to 
search for some reasonable restrictions that can 
be imposed on this general problem, so that these 
restricted problems are amenable to solution by 
analytical means. These restrictions are the 
subject of discussion in the next section. 


Solution of Special Cases 


Equal Cost—Equal Probability: The general 
problem studied in the previous section can be 
simplified considerably if certain special cases 
are considered. There exists in the literature 
several proofs that the so-called ‘‘half-split’’ 
technique of testing results in the minimum aver- 
age cost when the probability of failure is the 
same for all elements and the costs of all tests 
are the same. The half-split technique implies 
that at each stage in the testing procedure, a 
test is made that separates the remaining ele- 
ments into two groups containing equal numbers 
of elements. It can also be shown [7] that the 
testing procedure is somewhat more general than 
this. That is, if an equipment has N elements, 
where 


N=2™+R 


with 0 <R< 2™, and it is known that exactly 
one element is bad, then the following testing 
procedure yields the optimum solution: 


At each state in the sequence where 
there are N' = 2™ +R' questionable ele- 
ments, choose any test which partitions the 
elements into two groups such that each 
group contains at least 2'- 1 question- 
able elements. 


The repeated applications of this rule will yield 
the optimum testing procedure. Note that there 
are many different testing procedures that can 
yield the same result.* The average cost (6% 


* This rule yields not only the testing procedure with 
the minimum average cost, but also the minimum of 
the maximum possible cost, regardless of which ele- 
ment has failed. It is also the min-max solution when 
the probabilities of failure are unequal. 


1960 


associated with the optimum testing procedure is 
Co = (m+ 2R/N) Ct where Cy is the cost of a 
Single test. 

Equal Cost—Unequal Probability: This work is 
now generalized to include the situation where the 
cost of all tests are the same, but the probability 
of failure of the elements may all be different. 
The problem that must be solved is to determine 
that testing procedure that yields the minimum 
average cost to locate the bad element. For the 
given N elements, and their probabilities of fail- 
ure, there exist 6(N) testing diagrams. How- 
ever, not all of these diagrams need to be consid- 
ered in order to find the optimum. Note that if 
the element identification is removed from the 
blocks in the testing diagram, then the resulting 
structure is one of the [(N) different structures 
possible. It is apparent that if two different 
structures require the same number of tests to 
isolate each of the elements, then the average 
cost associated with the two testing procedures 
will be the same. Consequently, not all of the 
I(N) structures yield different average costs— 
only X(N) are different. 

The problem to be solved now is the determi- 
nation of which of these X(N) structures yield 
the minimum average cost. One approach would 
be to compute the average cost associated with 
each structure, and choose the smallest. How- 
ever, an optimum coding problem as solved by 
Huffman [5] gives the solution directly. In order 
to show that this is true, it is necessary to draw 
the proper analogies between the two problems. 
One such set of analogies is: 


Optimum Coding 


Problem Optimum Testing 


1) Message is sent. 1) Element is bad. 

2) Message ensemble. 2) Equipment. 

3) Two coding symbols. 3) Two results from a 

4) Number of symbols test. 
in a message code. 4) Number of tests to 

5) Time for transmis- isolate an element. 
sion of a coded mes- 5) All tests of equal cost. 
sage directly pro- 6) Lowest possible aver- 
portional to the num-___ age cost to locate the 
ber of symbols asso- _ bad element (optimum 
ciated with it. testing procedure). 

6) Lowest possible 
average message 
length (optimum 
code). 


Using these analogies, Huffman’s method to de- 
termine the optimum code yields the optimum 
testing procedures. The steps in the procedure 


BRULE, JOHNSON, KLETSKY: DIAGNOSIS OF EQUIPMENT FAILURES 33 


are presented below, without proof, as adapted 
from Fano [6]. 


No. of Tests Element p(X,) 1- 


oa FF ww uw vt 
tal 


Fig. 7 —Construction of the optimum number of tests. 


Step 1) Arrange the elements in order of de- 
creasing probability, as shown in Fig. 7. 

Step 2) Group together the 2 least probable ele- 
ments and compute the total probability 
of such a subset. 

Step 3) Obtain an auxiliary ensemble of elements 
from the original ensemble by consider- 
ing the subset of 2 elements formed in 
Step 2 as a single element with probabil- 
ity equal to the probability of the subset. 
Rearrange this auxiliary ensemble in 
order of decreasing probability as shown 
in Fig. 7. 

Step 4) Form successive auxiliary ensembles by 
repeating Step 2 and Step 3 until a single 
element of unity probability is left in the 
ensemble, as illustrated in Fig. 7. 

Step 5) The number of tests that must be con- 
ducted to isolate each element can be de- 
termined from the tree diagram that is 
obtained at the conclusion of Step 4. To 
obtain this number, trace the path from 
an element to the vertex of the tree. The 
number of times the element is combined 
with other elements is the required num- 
ber of tests. These results are listed in 
Fig. 7. 

Step 6) Any testing diagram which isolates the 
elements in the same number of tests as 
determined in Step 5 will have the lowest 
possible average cost. One such testing 
diagram is shown in Fig. 8 for the ex- 
ample of Fig. 7. The average cost for 
this example is: 


C=C; (018 x62 x24 0) 1x 3x3 4.005 
x4x2| 


(ero 


This is the lowest possible average cost. 
This six step procedure is the general solution 


34 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Fig. 8 —An optimum testing procedure, 


for the problem where all tests have the same 
cost and all tests are possible. It thus includes 
as a Special case the situation when the proba- 
bilities are all equal. 

It is possible to exploit further the analogy be- 
tween the optimum coding problem and the diag- 
nosis problem. In terms of the coding problem, 
the following bounds exist for a binary code: 


H(X) < N* < 1+ H(X), (2) 


where N* is the average number of symbols per 
message and 


N 
H(X).=.- oo Pj loge Pj - 
jal 


Thus, by the analogy we have stated, 
Cy [H(X)]J< C < Cy [1 = H(X)]. (3) 


Eq. (3) gives upper and lower bounds on the 
average cost for this case. The importance of 
the lower bound is recognized by noting that if a 
testing procedure is devised under the constraint 
that not all tests are possible, then the average 
cost of this testing procedure can be compared 
with C|H(X)] to determine just how much im- 
provement would be possible if all tests were 
available. 


CONCLUSIONS 


The primary result of the present paper is 
the introduction of several general concepts 
which are applicable to the problem of diagnosis 
of equipment failures. Within the framework of 
the definitions given for an equipment, an ele- 
ment and the model of a test, it is possible to 
define and construct a diagram of a testing pro- 
cedure. While the diagram has only been used 
here for tests which have only two possible re- 
sults (pass or fail), the generalization to more 
detailed tests which may have several possible 
results is apparent. It is shown that a com- 


APRIL 


pletely general testing diagram becomes quite 
complicated even when the equipment under con- 
sideration is not complicated. However, a major 
simplification is obtained by introducing a sim- 
plified diagram with suitably restricted tests. 
This simplified testing diagram may be used re- 
peatedly in order to find all the bad elements of 
the equipment. 

In terms of the testing diagram, it is possible 
to compute the minimum average cost of diagnos- 
ing the equipment, and this appears to be the 
most useful measure of the efficiency of a test 
procedure. Specific solutions for optimum pro- 
cedures are included here for two special cases. 
However, the determination of optimum: proce- 
dures for the more general case, where the costs 
of tests may be unequal and only a limited set of 
tests are available, remains unsolved. A com- 
panion paper (see footnote 2, page 26) presents a 
method, using information theory concepts, which 
leads to near optimum testing diagrams for the 
general problem. The combination of the solu- 
tions for special cases and the information theory 
method provides a useful guide for constructing 
efficient test procedures. 

It is believed that the concepts and techniques 
developed here are applicable to the diagnostic 
procedures for a wide class of equipments; how- 
ever, the ultimate evaluation can only be made 
after applications have been attempted for sev- 
eral different types of equipment. One such 
application is currently being attempted by the 
authors. 


REFERENCES 


[1] A. J. Hoehn and E. Saltz, ‘‘Mathematical models for 
determination of efficient troubleshooting routes,’’ 

IRE TRANS. ON RELIABILITY AND QUALITY CON- 
TROL, no. PGQRC-13, pp. 1-14; July, 1958. 

[2] R. B. Miller, J. D. Foley, Jr., and R. P. Smith, ‘‘Sys- 
tematic Trouble-shooting and the Half-Split Technique,”’ 
Human Resources Res. Cen., Lackland Air Force Base, 
Tech. Rept. 53-21; July, 1953. 

[3] L. M. Stolurow, G. Bergum, T. Hodgson, and J. Silve, 
“The efficient course of action in ‘troubleshooting’ as 
a joint function of probability and cost,’? Educ. and 
Psychol. Meas., vol. 15, pp. 462-477; 1955. 

[4] ‘‘Synchro troubleshooting,’’ Electronic Ind.; June, 
1958. 

[5] D. A. Huffman, ‘‘A method for the construction of 
minimum redundancy codes,’’ PROC. IRE, vol. 40, 
pp. 1098-1101; September, 1952. 

[6] R. M. Fano,‘‘Notes for Subject 6.574, Statistical Theory 
of Information,’’ M.I.T., Cambridge, Mass., pp. III-10 
to I-14; 1954. 

[7] R. A. Johnson, E. Kletsky and J. D. Brule’, ‘‘Diagnosis 
of Equipment Failures,’’ SURI, Rept. No. EE 977-594T1; 
April, 1959. AD-213876. 


1960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


AN INFORMATION THEORY APPROACH TO DIAGNOSIS* 


R. A. JOHNSONT 


ABSTRACT 


In the preceding paper, a model of a sequential 
diagnostic test procedure is developed for appli- 
cation to fault location in electronic equipment. 
The average cost of diagnosis is defined and the 
problem of finding procedures of minimum aver- 
age cost is solved for two special cases. In the 
present paper, the ratio of the average informa- 


*In order to maintain continuity of the preceding 
paper by J. D. Brule, R. A. Johnson, and E, J. Kletsky, 
‘Diagnosis of Equipment Failures,’’ this issue, p. 23, 
the above abstract is presented here. This paper, 

‘‘An Information Theory Approach to Diagnosis,’’ will 
be provided at the Sixth National Symposium on Relia- 
bility and Quality Control, January 11-13, 1960, and 
will be available in the. Proceedings of that Symposium. 

T Elec. Engrg. Dept., Syracuse University, Syracuse, 
Ns Ys 


tion gained by performing a given test to the cost 
of the test is introduced as a figure of merit for 
the test. Repeated application of this figure of 
merit to choose successive tests results ina 
systematic way of constructing efficient sequen- 
tial test procedures. The procedures so con- 
structed are compared with known optimum pro- 
cedures for several special cases as a means of 
evaluating the information theory approach. For 
some special cases the true optimum is obtained; 
in others, the information theory test procedures 
are only slightly less efficient than the optimum. 
The advantage of the information theory approach 
lies in the fact that it results in a simple system- 
atic way of constructing efficient test procedures 
even in the general case for which the true opti- 
mum solution has not been obtained by other 
means. 


RELIABILITY OF PARALLEL ELECTRONIC COMPONENTS 


H. WALTER PRICEt 


Summary—Electronic components are fre- 
quently connected in parallel as a measure to in- 
crease reliability. Whether the result of sucha 
parallel connection results in an increase or a 
decrease in reliability, and the amount of such 
increase or decrease, is a function of the open- 
circuit failure probability and the short-circuit 
failure probability. Equations are derived which 
permit a determination of the increase or de- 
crease of reliability when components are con- 
nected in parallel. Some curves are included to 
aid the circuit designer in this determination. 


{Reliability Branch, Diamond Ordnance Fuze Labs., 
Washington 25, D. C. 


n COMPONENTS CONNECTED IN PARALLEL 


Electronic components may fail catastrophi- 
cally by open-circuiting or by short-circuiting. 
A given component, then, will have a probability 
of failure by open-circuiting and a probability of 
failure by short-circuiting. The total probability 
of failure of such a component is given by the 
addition law of probability. Since, in general, a 
component cannot fail simultaneously by open- 
circuiting and by short-circuiting, the mutually 
exclusive event form of the law of addition of 
probability applies. Thus, 


q=rt+s (1) 


35 


36 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


where q = total probability of failure 
r = probability of open-circuiting 
s = probability of short-circuiting 


(ee Pek We eS Ty eK (eyes ae 


Let n components be connected in parallel. 
These components have open-circuit probabilities 
of 


ll 


Yr ls T2 3 T3 re eee ee) ln 
and short-circuit probabilities of 
Sty S25 S350. es Sn - 


Let it be assumed that the failure probabilities 
are statistically independent (7.e., the failure of 
one component in no way affects the probability 
of failure of the other components). 

Consider, first, the case when 


Sip = Shy SS Sep Sa 6 .-=S,=0. 


Since, in this case, the components can fail only 
by open-circuiting, the failure of one component 
does not constitute a total circuit failure. In 

fact, all components must fail to constitute a total 
circuit failure. The probability of total circuit 
failure, then, is the joint probability of the com- 
ponents. Thus, 


Ch = hs o 16 a HE oe 6 oS 1 


n 
Sl aes (2) 
i=1 


Next, consider the case when 
TP) ES ya ow Stas og = 0. 


Since, in this case, the components can fail only 
by short-circuiting, the failure of any one consti- 
tutes a total circuit failure. The probability of 
total circuit failure, then, is given by the law of 
addition of probability for n events. Thus, 


Gye ee (3) 
neal 


In the general case, where the components 
can fail by either open-circuiting or by short- 
circuiting, the probability of total circuit failure 
is given by the sum of (2) and (3). Thus, 


n 
n 


TT eh | te IT (1a) (4) 


i=1 i=1 


Gn = 


If the components which connected in parallel 
are identical-type components, then 


Jk Se Gey NC ae SG 


APRIL 


and (4) reduces to 
de Sree iper t ats (5) 


resultant failure probability of 
identical-type components con- 
nected in parallel 

n = number of such components 
parallel 

open-circuit failure probability 
of each such component 

s = short-circuit failure probability 
of each such component. 


where qy 


8 
Il 


TWO COMPONENTS CONNECTED 
IN PARALLEL 


The case of two identical-type components 
connected in parallel is important because of its 
common occurrence. For this case, (5) reduces 
to 

q, =r 425-78 (6) 


It is of interest to know the amount of im- 
provement in reliability by connecting the two 
components in parallel. An improvement in re- 
liability is equivalent to a reduction in the prob- 
ability of failure. Let there be defined, then, a 
failure probability reduction factor Jj, relating 
the failure probability of n components in paral- 
lel to the failure probability of a single unit. 
Thus, 

Ree 


1 


Gy, fyi Or (7) and (8) 

It should be noted that when Jj, < 1, the fail- 
ure probability is reduced by the parallel connec- 
tion. Conversely, when Jy, > 1, the failure 
probability is increased by the parallel connec- 
tion. Likewise, when J, = 1, the failure prob- 
ability is unchanged by the parallel connection. 

It is obvious, then, that in order to improve reli- 
ability the designer should connect the compo- 
nents in parallel only when J, < 1. 

Since q, and qi: of (6) are functions of r 
and s {see (1)], then I) must be a function of | 
r and s. In the two-component case being con- 
sidered, (8) can be written 


2 2 : 
_ Gere 28s 
I. qi ee ee (9) 


Let a failure ratio factor 7 be defined as 


ms short- circuit probability 


fos 
open-circuit probability  r ° (10) 


Then (9) can be written as 


1960 PRICE: RELIABILITY OF PARALLEL ELECTRONIC COMPONENTS 37 


pr, = Gam) + 2n (11) eel er oreen ie wl © 
1+ a 
q The magnitude of such an improvement (or 
such that [2 is a function of the Open- circuit degradation) can then be determined with the aid 


probability r and the ratio of the short-circuit of Fig. 1. 
to open-circuit probabilities 7. 

Fig. 1 is a chart for use by a circuit designer 
to determine the change in probability of failure 
by connecting two identical-type components in THREE COMPONENTS IN PARALLEL 
parallel. This chart was computed from (11) and 
gives J2 asa function of r with 7 asa para- 
meter. 

In Fig. 1 it is to be noted that 


[z= 1 cwhen = 1 


For the case of three identical-type compo- 

nents, (5) reduces to 
Qsga-tes=s =238 <+.5° (12) 

for all values of r. Thus, if the short-circuit and (8) reduces to 
probability is equal to the open-circuit proba- fp the r°> + 3s - 3s? + s° 
bility, the total failure probability of the two com- 2a qr r+s . 
ponents in parallel is identical to the failure 
probability of a single unit. Hence, the curve for 
7 = 1 has been labeled the ‘‘break-even”’ line in adel 4 9)r7 =" 3n? re 37 (14) 
Fig. 1. 1+7 

From Fig. 1 and (11) it is obvious that 


(13) 


Eq. (13) can be written in terms of 7 from (8) as 


Fig. 2 is a chart for use by a circuit designer 


Iz <1 when 7 < 1) for all values of to determine the change in probability of failure 
r [subject to the by connecting three identical-type components in 
I, = 1 when n= 1) restrictions parallel. This chart was computed from (5) and 
gives [3 asafunctionof r with 7 asa par- 
Iz >1 when n> 1 venta ameter. 
0 <(ras) < 1] 


In Fig. 2 it can be seen that 
Therefore, to determine if a reliability improve- 
ment can be realized by connecting two compo- 

nents in parallel, it is only necessary to deter- for most values of r. Therefore, the curve 7 = 3 
mine whether has been labeled the ‘‘break-even’’ line in Fig. 2. 


I; ~1 when n= 


0 

; | SHORT-CIRCUIT PROBABILITY é + 
i at 7 * “OPEN- CIRCUIT cena 

5 

4 

3 


2 al 
—— oa 


“= WZan~ewo 


n 


FAILURE PROBABILITY REDUCTION FACTOR, I 


Fig. 1—Failure probability reduction 
factor (I, ) for 2 identical-type com- 


& << RR S83- 


EXAMPLE: 


iho 


r= Ol 
s=.001 he .19 
q,=-01 9, * 0022 


38 


It is obvious, then, that for three components 
in parallel 


for most values of 
r [subject to the 
restrictions 


Orcerscg tl; 
Oxa(r+ <2):<4t]. 


As in the two component case, it is only neces- 
sary to examine the value of 7 to determine if a 
reliability improvement can be made by connect- 
ing three components in parallel. 


i 
2 


Ture) when Na 


[52=3 whens} 


2 


IES 2 


when 7 > 


OPTIMUM NUMBER OF PARALLEL 
COMPONENTS 


In Figs. 1 and 2, it can be seen that 3 compo- 
nents in parallel results in a lower value of J" 
than 2 components for certain combinations of 
values of r and 7. However, 2 components in 
parallel results in a lower value of J” than 3 
components over a much larger combination of 
values of r and 7. It follows then that for any 
particular combinations of values of r and 7 
there is an optimum number of components con- 
nected in parallel which will result in the lowest 
value of J” and, hence, result in the highest reli- 


Ses 


if 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


ability circuit. 

Consider, then, that there are regions related 
to certain combinations of values of r and 7 
where 2 components in parallel constitute the 
most reliable circuit. Similarly, there are re- 
gions where 3 components in parallel constitute 
the most reliable circuit, etc. 

The regions corresponding to the different 
numbers of optimum components are adjacent to 
each other such that the region corresponding to 
2 components lies between the regions corres- 
ponding to 1 component and 3 components, etc. 
Two adjacent regions will be separated by a 
border line. 

To obtain the location of these border lines, 
consider the general form for J). 


_rM+1-(1--s)P 
r+s : 


ig! (15) 
To determine the border line between the 1 
component and 2 component regions set 


i re 


Solving for 7, it can be determined that J; 
I’2 when 7 = 1 for all values of r. 

Likewise, to determine the border line be- 
tween the 2 component and 3 component regions 
set 


This results in an equation of » and r which 


SHORT-CIRCUIT PROBABILITY 
OPEN-CIRCUIT PROBABILITY 


} 


| 
| 


lead mel! rt 
~ EVEN" LINE i ie 


aE 


ue 2 we revo 


YAU 
NOS 


_ FAILURE PROBABILITY REDUCTION FACTOR, Ty 


ona 


Fig. 2—Failure probability reduction 


o 
= 


factor (1r,) for 3 identical-type com- 


Zaz 


Za 


ponents in parallel. 


r=.01 
s=.00i T= 27 
q,7- 011 q, = -0027 


OD es SS SE SS om mw 
co 


PRICE: RELIABILITY OF PARALLEL ELECTRONIC COMPONENTS 39 


Cots 


Cp ee es ee oe es Se 
RA hes aie 


omaaEee 
ssaneee 
jgm@eece 
Bee 


Ha 


AT 
RE 8) ame MS RR Nit 
[ES Nina) a 


a et 2 


: 


ay aases 


nan Att oon 
Lia AT 


107* Swiss SSS SS Seti SS Sse mame soe 
SS EG AS SO ee a 
22 a Whos) See a RO 
Se eS Hass ee See Coith (oa PPS ve 


7 
REGION WHERE MOST 


. =e 
U 
: = CiRcur— 
ron ttt © Come, == 
— Be 
ze 
om 
ie) 
o® 
309 


10 


is the equation of the border line separating the 
regions. 
Fig. 3 shows the results of computing the bor- 


OPEN CIRCUIT PROBABILITY 


der lines for each of the regions up to 6 compo- 
nents in parallel. It is to be noted that 1 compo- 
nent is the most reliable for all 7 > 1. Itis to 
be further noted that 2 components in parallel 
results in the most reliable circuit for most 
practical values of 7 and r for n< 1. 


CIRCUIT DESIGN PROCEDURE 


1) Obtain values (or estimates) for s and r. 


i 
mai ] 
VY 


Te pa 
iit mi a L 


Fig. 3—Optimum num- 
ber of parallel compo- 
nents. 


2) Use Fig. 3 to determine the most reliable 
number of components to place in parallel. 
3) Use Figs. 1 or 2 or 


r+1-(1- s)? 
r+s 


Ih= 


as may be applicable to determine the re- 
duction in failure probability. 
4) Use 
= [yds =F p(t + 8) = Ja(r + nr) 
to determine the total failure probability of 
the parallel components. 


40 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


EVALUATION AND PREDICTION OF CIRCUIT PERFORMANCE 
BY STATISTICAL TECHNIQUES 


J. MARINI* and R. WILLIAMS} 


Summary—A method is described for predict- 
ing circuit performance to the extent that it is 
dependent on part performance. The basis for the 
prediction is the performance of parts as meas- 
ured at test points fixed by the part specifications. 
Implicit in this method is the assumption that the 
distribution of part performance at the test points 
can be predicted from consideration of the speci- 
fications. Such an assumption is necessary to 
any attempted prediction of this nature. 

An empirical equation giving circuit perfor- 
mance in terms of part performance as measured 
at the test points is assumed. The exact form of 
the equation is determined experimentally, by 
means of regression analysis of data consisting of 
sets of measurements of breadboard models of 
the circuit. The empirical equation is then used 
mathematically to calculate the distribution of the 
circuit performance from the assumed distribu- 
tions of part performance. 

The method has been applied successfully to 
predict the laboratory performance of an ac 
amplifier and a telemetering oscillator. In 
principle, the method can be extended to the pre- 
diction of equipment or system performance. 


I. INTRODUCTION 


The evaluation of the design of a circuit in- 
tended for mass production can be regarded as a 
problem in prediction. The conclusion that a 
given circuit design will be satisfactory actually 
amounts to a prediction that the performance of 
most of the circuits to be produced under the de- 
sign will fall within specified limits. In this 
sense, the preproduction evaluation of a circuit 
design can be regarded as a statistical prediction 
of the distribution of circuit performance. 

In practice, this prediction usually is based to 
a considerable extent on observations of the per- 
formance of a number of preproduction models of 
the circuit. Occasionally, the performance may 
be observed when “‘limit’’ tubes or parts (that is, 
tubes or parts whose characteristics barely meet 
specification limits) are used. However, merely 


*Electromagnetic Res. Corp., Washington, D. C.; for- 
merly at Arinc Res. Corp., Washington, D. C. 
fArinc Res. Corp., Washington, D. C. 


to observe the performance of the models without 
simultaneously relating this performance to the 
properties of the particular parts which are used 
to construct the models can be misleading. The 
use of limit tubes and parts is a crude attempt to 
avoid this danger. 

The purpose of this paper is to describe a 
systematic, quantitative method of predicting the 
initial performance of an electronic device on the 
basis of the initial performance of its parts, 
through use of the statistical technique of mul- 
tiple-regression analysis. An important feature 
of this method is that it provides a means of 
bridging the gap between circuit performance and 
specification-controlled part performance. 

The method is demonstrated by means of the 
following problem: an oscillator which is to be 
produced in quantity has been designed to operate 
at a certain frequency. It is known that, when a 
number of these oscillators are constructed, the 
output frequencies will be distributed over a 
range of values. The variations from the desired 
value will be due principally to variations in the 
performance characteristics of the parts used in 
making the oscillators. If the design is good, the 
mean of the distribution will fall near the design- 
center frequency. However, unless it is practical 
to adjust the frequencies of individual oscillators, 
it may be necessary to reject those in which the 
deviations from design center are too wide. 
Therefore, it would be highly desirable to be able 
to predict what the probability distribution of the 
output will be when the oscillators are manufac- 
tured in large numbers. This would make it pos- 
sible to determine in advance what percentage of 
the oscillators would be acceptable. 


Formulation of the Problem 


In the method reported here, it is assumed that 
the performance of a circuit can be expressed in 
terms of one or more measurable variables, 
which are termed ‘‘circuit characteristics”—for 
example, the frequency and power output of an 
oscillator. In order to simplify this exposition, 
only one characteristic of the oscillator circuit— 
frequency—will be considered. However, the 
methods used can be extended to handle more than 
one characteristic. 

It is also assumed that the performance of 
parts can be expressed in terms of measurable 


1960 


variables called part characteristics—-e.g., the 
resistance of a resistor and the transconductance 
of an electron tube. In this connection, however, 
it is important to observe that the value obtained 
for a given part characteristic depends on the 
conditions under which it is measured. The term 
‘operating part characteristic’’ will be used here 
to designate a characteristic measured under 
actual circuit operating conditions, while the term 
‘“specification part characteristic’’ will be used 
to signify a characteristic measured under the 
conditions specified in the part-specification 
sheet. Unless the conditions under which a char- 
acteristic is measured in the circuit are identical 
with those stipulated in the specification, the op- 
erating part value and the specification part value 
of a measured characteristic are most probably 
different. 

The fact that operating part characteristics 
and specification part characteristics are not 
identical gives rise to some difficulties. Design 
handbooks ordinarily give equations for circuit 
characteristics in terms of operating part char- 
acteristics.’ Therefore, if a designer wishes to 
predict the distribution of the circuit-character- 
istic values which will result from his design, he 
must have information concerning the distribution 
of the operating characteristics of the parts used 
in his design. At best, however, only nominal or 
design-center values of the operating part char- 
acteristics will be provided by the manufacturer. 
The only good source of information about the 
distribution of part characteristics is the part 
specification; but, to make use of these specifica- 
tions, it is necessary to bridge the gap between 
the circuit characteristic and the specification 
part characteristic. 

Stated in mathematical terms, then, the prob- 
lem under consideration is prediction of the prob- 
ability distribution, p(Y), of a circuit character- 
istic, Y, where Y is considered to be a func- 
tion of the specification part characteristics. 


*¥For example, the gain at resonance of a loaded tuned 
amplifier may be given as a function of the transconduc- 
tance of the tube and the resistance of the loading resis- 
tor. However, the value of transconductance used in the 
equation is measured at the operating point of the tube as 
it is used in the circuit, and this, in general, is different 
from the value that would be obtained through measure- 
ment at the specification test point. Similarly, the re- 
sistance value given in the equation is the equivalent shunt 
resistance of the resistor as measured at the operating 
frequency of the circuit. If the frequency is high, this 
value will differ from the dc resistance prescribed in the 
specifications. 


MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE 41 


Actually, the method described in this paper is 
used to predict, not p(Y), but only the mean, iu, 
and the variance, o”, of the distribution of Y. 
Fortunately, this is not a very severe limitation, 
because, where the distribution of Y is normal, 
the exact probability distribution is determined 
by these quantities, and where it is not normal, a 
great deal of information is provided by them. It 
is known that 95 per cent of the population of Y 
must lie within the limits + 20 when Y is 
normally distributed, and it can be shown that 95 
per cent of the population of Y must lie within 
the limits p + 4.50, no matter what the shape of 
the distribution happens to be.” 

Illustrative example: The method used in this 
paper has much in common with a technique de- 
scribed in a Wright Air Development Center 
Report® for predicting tolerance limits on the 
output of a circuit whose performance depends on 
only a single specification part characteristic. 
The example given in the report is ideal for illus- 
trating the basic ideas underlying the method used 
here, and is reproduced below. The circuit char- 
acteristic of interest was the output current of an 
electron tube as measured in the circuit, and the 
only specification part characteristic of import- 
ance was the plate current of the tube at a test 
point fixed by the specifications. When a number 
of tubes were measured at the specification test 
point, and then inserted consecutively into the 
circuit, the data shown in Table I were obtained. 
These data are plotted in Fig. 1. The straight 
line drawn in Fig. 1 is determined by the least- 
squares solution of the equation 


Ig =bo +b Ig + € 


where I, is the tube output current as measured 
in the circuit, and Ig is the current measured at 
the specification test point. The vertical scatter 
of the points about this line corresponds to the 
error term, €. 

Fig. 1 suggests a number of considerations 
involved in the use of this type of method. First, 
the distribution of test-point plate currents of the 
sample of tubes used does not represent very well 
the distribution allowed under the specification. 
Second, the illustration involves extrapolation, 
although this would not have been necessary if 


24. Hald, ‘‘Statistical Theory with Engineering Appli- 
cations,’”’ John Wiley and Sons, Inc., New York, N. Y., pp. 
109-110; 1952. (Tchebycheff’s Inequality.) 


3 author, ‘‘Techniques for Application of Electron 
Tubes in Military Equipment,’’ Wright Air Dev. Center, 
Dayton, Ohio, Tech. Rept. 55-1; October, 1955. 


42 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


TABLE I 


EXPERIMENTAL DATA ON ELECTRON- 
TUBE OUTPUT 


Tube Plate Current | Circuit Current 
Number (milliamperes) | (milliamperes) 


8.10 4.54 


tubes were selected to cover the entire permis- 
sible range of variation of test-point current. 
Third, the fact that the scatter of the points about 
the least-squares line is small makes it seem 
likely that the output current is indeed deter- 
mined by the specifications. 


6.0 


PREDICTED MAXIMUM 


w 
o 


> 
° 


PREDICTED 
MINIMUM ie 


OUTPUT CURRENT (Ic), IN MILLIAMPERES ( MEASURED IN THE CIRCUITS 


o 
°o 
MIL-E-1 MINIMUM 


PLATE CURRENT (Is), IN MILLIAMPERES 
(MEASURED UNDER SPECIFIED MiIL-E-| CONDITIONS) 


Fig. 1—Circuit output vs specification current, showing 
regression line used for prediction. 


APRIL 


6.0— 


4.0 


OUTPUT CURRENT (Ic), IN MILLIAMPERES ( MEASURED IN THE CIRCUITS ) 


3.0 


| 
| 
| 
| 
| 
| 
| 


MIL-E-1 MINIMUM 
MIL-E-| MAXIMUM ee 


PLATE CURRENT (Is), IN MILLIAMPERES 
(MEASURED UNDER SPECIFIED MIL-E-| CONDITIONS ) 


Fig. 2—Circuit output vs specification current—not 
suitable for prediction. 


Fig. 2 was drawn to illustrate the fallacy of 
merely measuring the circuit characteristic, 
without analyzing the relationship between it and 
the specification part characteristics. The re- 
sults shown in Fig. 2 might conceivably have been 
obtained on the same circuit that provided the 
data shown in Fig. 1. In Fig. 2, it is evident that 
the output is not controlled by the specification 
part characteristic selected, and that further 
study is required to determine the source of the 
variation of the circuit characteristic. If the 
circuit characteristic is dependent on part char- 
acteristics not controlled by the specifications, 
there is always the possibility of large shifts in 
the distribution of the circuit characteristic when 
different lots of parts are used. 

This possibility exists, although to a lesser 
degree, even when good correlation is obtained. 
For example, a circuit may be critical with re- 
spect to grid current, while the tubes used in the 
experiment may happen to be from a lot which is 
exceptionally good in this respect. It should be 
borne in mind that the predictions made here are 
merely for the best possible behavior to be ex- 
pected under the specifications over a long period 
of time. Extraneous factors not included in the 
considerations can always arise to invalidate the 


1960 


predictions. However, a quantitative prediction 
of performance as a guide makes possible the 
detection of these extraneous factors when they 
do occur. 

In general, the method described in this paper 
is a mathematical extension of the basic ideas 
illustrated in Fig. 1. However, when more than 
one part characteristic is considered, it is no 
longer satisfactory to use the concept of minimum 
and maximum limits. Instead, it becomes neces- 
sary to resort to the distribution concept.* 


Il. METHOD 


Description 


This paper describes a method for predicting 
the values of the mean, p, and the variance, o”, 
of the distribution of a circuit characteristic, Y, 
by using regression analysis to determine empir- 
ically a relationship between Y and the specifi- 
cation part characteristics, Xi... X,, of the 
parts used in the circuit. The determining rela- 
tionship and the distribution of the specification 
part characteristics as assumed from knowledge 
of the specification are used to calculate pre- 
dicted values for . ando”*. The general method 
is to assume that the equation which relates the 
circuit characteristics to the specification part 
characteristics can be expressed as a linear 
combination of known functions, f;(Ki, X2,.. 
X},), of the specification part characteristics” 


(Xi, X2,...X),), plus a random variable, €. 
Y = bo + bifi(Ki,Xe2 AS hi + Sei ee 
~ bpfp (Ki X2 s oe KEY FE: (1) 


It is assumed that the functions fj can be so 
chosen that practically all of the variation in 4 
attributable to the specification part characteris- 
tics X,...X, will be contained in the linear 
combination. The random variable ¢ will then 
represent the variation in Y caused by variation 
in operating part characteristics, experimental 
error, and other sources not controlled by the 
values of the specification part characteristics 
considered. It is arbitrarily assumed that e« is 
normally and independently distributed, with 

4R. C. Miles, ‘‘Tolerance considerations in electronic 
product design,”’ Electronic Design, vol. 1, pp. 6-7, May, 
1953; vol. 1, pp. 6-7, June, 1953. 

5 Other variables—e.g., part characteristics not con- 
trolled by the specifications, applied voltages, and even 
ambient temperature—can also be included as arguments 
of the functions. However, the performance prediction 

will be improved by inclusion of such variables only if 
information is available concerning their probability 
distributions. 


MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE 43 


mean zero and an unknown variance, o” . 

If a number of models of the circuit are con- 
structed and the values of X, ...X,, together 
with the corresponding value of Y, are meas- 
ured on each model, it is possible to solve (1) for 
the values of the unknown constants b,... by, 
and for the unknown variance oF , by means of 
regression analysis. The values of the constants 
and the variance can then be used to estimate the 
values of ,: and o”%. Expressions for » and 0” 
in terms of by... bp and of can be obtained 
from (1) by using the properties of the expected 
value® and following the procedures described 
below. An illustration or two should clarify this. 

Linear expansion: First, assume that Y can 
be expanded in terms of the specification part 
characteristics in a Taylors’ series with second- 
order and higher terms neglected. Eq. (1) would 
then become 

Y =bp + bik, arg oe . + by Xy, + €. 
Taking the expected value of Y, 

E(Y) = bo + biE(Xi) +... + bgE(X,) 

+ E(e)=bo+ biuit...+ Dk Es (2) 


where py, is the mean of X;. The expected or 
mean value of € is assumed to be zero. Also, 


o = EV ny 
E[bi(X1 Say a ENE by (Xx =; Lk) a e]” 
= biE(Xi- pi)? +... + by E(X, - uy)" 
uF E(e*) + 2bib2 E[(Ki - Wi) (Ke - U2) | 
Fee att 2b, E[(K1 - Ui)(e) | cin camer re 
+ Deo + + Zibsoi2 +... 
(3) 
2 


where oF is the variance of Xj,o, is the vari- 
ance of €, and 7% is the covariance between 
X; and X;. Since it has been assumed that € is 
distributed independently of X;, the covariance 
between € and Xj is zero. 


84 discussion of expected values can be found in most 
books on mathematical statistics. The value of a function 
{(X,..-Xp) is defined as an integral, 


E(f) = J... J £ p(K1,K2, «--) Xp) AKidXe2 ... Xp. 


From this definition, it is easy to show that the expected 
value of a sum equals the sum of the expected values; that 
the expected value of the product of a constant and a vari- 
able equals the product of the constant and the expected 
value of the variable; and that the expected value of the 
product of independent variables equals the product of the 
expected values. 


44 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


As was stated above, numerical values for the 
b’s can be obtained experimentally by use of re- 


P : . 2 
gression analysis, aS can an estimate for o@ . 


Numerical values for 1; and 0g; can be obtained 
from the specifications! The numerical value 
for the covariance terms O75 j will probably have 
to be estimated on the basis of measurements on 
the parts available.* By substituting these nu- 
merical values into (2) and (3), the numerical 
values sought for the mean and variance of the 
circuit characteristic Y are obtained. 
Nonlinear expansion: To illustrate the. possi- 
bility of using nonlinear terms, assume that Y 
is dependent on only two specification part char- 
acteristics, and can be written, Say, as 


Y =bo + biX: + beX2 +e. 
Then, 
= E(Y) = by + bi E(X1) + be E(X2) + Ele) 
I = Do + biti + ba(ue +02). 


o”? = E[(Y - uv)? ]= E(¥’) - 2uE(Y) + uw” = E(¥*) 


o” = E(bo + biX1 + beX2 + €)* - yw” = b?. 
+ bi E(X3) + bz E(X3) + E(e”) + 2bgbi E(X.) 
tel) by E(X2) 4) 2b1b,E(X4X4)-- 

o” = bo + bi (ui + 01) + bz E(X2) + o€ + 2bobipr 
ee 2boba (a G2) + 2babac (Xia) ie 


In this case, numerical values for the fourth 
moment of Xz and for the product moment? 


*Often, of course, the specifications do not contain in- 
formation about Hy; and oj, but merely give tolerance 
limits on the value of the specification part characteristic. 
Even in these cases, however, the realistic approach would 
seem to be to estimate these values, using the limits as a 

* 4 
guide. 
*oj,j can be estimated from the sample by using the 


formula 
b Z(Xj - Xi)(X; - X)) 

[= (Kj - &))’ - DK - XK)? ]? 
This formula was derived by assuming that the correlation 
coefficient between characteristics of the parts available 


in the laboratory is the same as that between characteris- 
tics of the parts in the population. 


a 1 | 049; . 


*These moments can be expressed in terms of moments 
about the mean by using the identity 


= eS =. 1 = 
KN = (X-y)2+ n(X-y)P-1 p + not) (X-)9- 2? + 2.4 yn 


APRIL 


E(X1X3) may have to be estimated from meas- 
urements of parts on hand. 

From this example, it should be clear that the 
method used is by no means limited to an expan- 
sion involving first powers of the specification 
part characteristics. The only penalty for using 
higher powers or products of powers is the oc- 
currence of higher-order moments in the expres- 
sions for ~. and go”, and the consequent neces- 
sity for estimating values for these moments 
from the specifications—or, as a last resort, 
from measurements on the parts used. 


Il, APPLICATION OF THE METHOD 


The method described in the preceding section 
has been applied to a voltage-controlled oscilla- 
tor. A circuit diagram of the oscillator is shown 
in Fig. 3. 


MULTIVIBRATOR 
+ 200V 
REGULATED 


DC AMPLIFIER | CATHODE | 
| FOLLOWER | 


V\, V2 = 5719 


V3, Va, Vs = 5718 
Ry, Rp = 3.3K +10% 
R, = 47K + 10% 

R,, Rs = 680K + 10% 
= \ Ry, Ry = 39K + 10% 


Cy, Cp =330 pus + 5% 


Fig. 3—Voltage controlled oscillator (grounded input). 


Definition of the Problem 


To apply the multiple-regression technique, it 
is first necessary to specify the circuit and de- 
fine the performance characteristic of interest. 
The circuit was chosen to be that shown in Fig. 3; 
under the assumption of zero input and stable 


and taking expected values. In the case of a normal dis- 
tribution, the fourth moment about zero reduces to 


E(X*) = 30% + 6u?0? + y* 


and this expression can be used to estimate the fourth 
moment from the specification. 


1960 


power supply. To provide a simple illustration of 
the method, the circuit performance characteris- 
tic selected was the nominal output frequency 
under the conditions given above. The problem 
was to predict the distribution of this output fre- 
quency for the population (production run) of 
oscillators. 

An important practical consideration in the 
application of the method to the oscillator circuit 
is the number of sets of observations necessary. 
A rough rule of thumb requires that the number 
of sets of observations should exceed the number 
of variables used in the regression equation by 30 
or more. Increasing the number of observations 
increases the accuracy with which the regression 
coefficients are determined. If the rule of thumb 
is used, it follows that each additional part char- 
acteristic or variable included in the regression 
equation necessitates at least one additional set 
of observations. In turn, each new set of obser- 
vations necessitates additional measurements on 
all of the part characteristics used in the regres- 
sion equation. In other words, the total number 
of measurements required tends to increase 
rapidly with the number of variables used in the 
regression equation. For this reason, the vari- 
ables to be used should be carefully selected; 
otherwise, the amount of laboratory work in- 
volved may become prohibitive. Since the com- 
putational work also increases rapidly with the 
number of variables used, an automatic computer 
is practically a necessity if there are more than 
four or five variables. However, if only two or 
three variables are used, this work can easily be 
done on a desk calculator. 


Circuit Analysis 


The circuit must be analyzed to determine 
what parts and what performance characteristics 
of these parts influence the circuit performance 
characteristic of interest. In addition to the cir- 
cuit elements, external factors, such as supply- 
voltage variation, can also be considered. The 
analysis has two purposes: 1) to determine the 
form of the regression equation, and 2) to reduce 
the number of variables appearing in it. The 
number of variables can be reduced either by 
eliminating those factors which obviously have no 
bearing on the circuit characteristic, or by di- 
viding the circuit into smaller sub-circuits, or 
stages, which can be considered individually. 

The latter device accomplishes the objective of 
reducing the number of variables in the regres- 
sion equations without sacrificing a variable 
which might influence the circuit characteristic. 
As experience with this technique shows, the 


MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE 45 


solution of two or more small regression equa- 
tions involves less effort than the solution of one 
large equation with many variables. Another ad- 
vantage of the breakdown of the circuit into its 
individual stages is that this facilitates determi- 
nation of the variables that influence circuit per- 
formance. In many circuits, the stages follow 
one another in a Series-type arrangement. When 
this is the situation, the variables in any given 
stage can be combined into one variable (the out- 
put of the stage) for inclusion in the next suc- 
ceeding stage. 

The oscillator circuit shown in Fig. 3 could 
have been treated as an entity, with simultaneous 
consideration given to all of the parts in the three 
stages. However, since this would have involved 
a regression equation with a large number of 
variables, it was considered more practical to 
analyze the circuit stage-by-stage. Circuit anal- 
ysis resulted in dividing the oscillator unit into 
three stages—dc amplifier, cathode follower, and 
multivibrator—as indicated by the dotted lines. 
The dc amplifier may be considered as an inde- 
pendent circuit with an output performance char- 
acteristic, E,. When the cathode follower is 
considered independently, E, must be considered 
a variable influencing the cathode-follower output, 
E2. Similarly, when the multivibrator is treated, 
E2 will be included as a variable influencing the 
multivibrator output frequency, f, which is, in 
effect, the output frequency of the oscillator. 
Through this procedure, all of the variables in- 
fluencing the oscillator output frequency will be 
considered in the multivibrator regression equa- 
tion, and the three regression equations that must 
be solved will all be much simpler than the com- 
bined equation would have been. 

Even if this procedure is followed, it is still 
desirable, if at all possible, to further reduce the 
number of variables. In the case of the dc ampli- 
fier (which will be used hereafter for illustrative 
purposes), it is obvious that, under no-signal 
conditions, variability in E, would be due to the 
resistive unbalance on either side of the Ey 
pick-off point. Since a stable voltage supply has 
been assumed, the circuit parts contributing to 
the unbalance would be the load resistors and the 
type 5719 tubes. The specification part charac- 
teristics which must be related to E; are be- 
lieved to be the dc resistance of the load resistor 
and the specification plate current (Ip) of the 
type 5719 tubes. Since it is the unbalance, or dif- 
ference, between the two resistors and tubes in 
the circuit that is of importance, it is logical to 
utilize the values of unbalance in determining the 
relationship. Consequently, the form of the 


46 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


regression equation chosen for the dc amplifier 
was 


Ey = bo + billy - Ip,) + be (Ri - Re) +e (4) 


where e€ is the error term added to take care of 
experimental and other variations in the data. 


Experimental Solution 


The problem now is to obtain data to use in 
solving the equation and verifying its validity. 
This is accomplished through a designed labora- 
tory experiment. 

Considerations involved in the design of an ex- 
periment are, of course, the size of the experi- 
ment, the accuracy of the test equipment and the 
methods used to obtain the data. The size of the 
experiment depends upon the degree of precision 
desired in the predicted results of the analysis 
and, consequently, in the estimate of the equation. 
Unfortunately, the precision in this case depends 
to a large extent on the value of the variance of 
the error term—and this value is not determined 
until the experiment has been performed and the 
data have been analyzed. In the experiment de- 
scribed here, 31 complete sets of observations 
were obtained and used in the analysis. 

The accuracy of the test equipment will ap- 
preciably affect the precision with which the esti- 
mate of the equation can be made. The important 
consideration is the need to avoid a consistent 
bias in the test equipment. It is better to have 
equipment of relatively low accuracy, if the vari- 
ations about the true value are randomly distrib- 
uted, than to have equipment which provides 
better accuracy, but is consistently biased high or 
low. Consistent bias will show up in the equation 
and throw the prediction or the estimated equa- 
tion off by a proportional amount. 

To obtain the required data, it is necessary to 
construct a number of breadboards—i.e., the cir- 
cuit-output performance characteristics of inter- 
est must be measured in what is essentially a 
different circuit each time. Obviously, the eco- 
nomical way to do this is to make a single bread- 
board mockup of the circuit, and change the 
circuit elements under consideration each time. 
This device can easily be used in the case of 
low-frequency applications; but, when higher fre- 
quencies are involved, proper consideration must 
be given to the design of the breadboard or it 
may be necessary to construct a large number of 
breadboards. 

Whenever possible, it is desirable to select 
parts that are representative of the parts that 
would be used in production. For example, if the 


APRIL 


tubes in the circuit are of types produced by sev- 
eral manufacturers, it is desirable to have a 
sample of tubes from each manufacturer who 
might provide tubes for the production run. Sim- 
ilarly, the parts used in a breadboard should be 
chosen at random from the total supply of parts 
available, so that they will be reasonably repre- 
sentative of the whole supply of parts available. 

In the study of the oscillator circuit, one 
breadboard model was constructed with spring 
clamps used to hold the resistors and capacitors 
in the breadboard, and standard tube sockets with 
subminiature adaptors used for the tubes. The 
parts and tubes chosen for the experiment were 
selected at random from those available within 
the laboratory, and each part was numbered indi- 
vidually. The parts were then tested for the 
characteristics considered important on the basis 
of their applicable procurement specifications. In 
the case of the dc amplifier, the type 5719 tubes 
were measured as specified in MIL-E-1. Sets of 
parts were then inserted into the complete bread- 
board (which used regulated power supplies to 
insure constant supply voltages), and the output 
performance characteristic of interest (output 
frequency) was recorded. It should be noted that 
this experiment was performed on the complete 
oscillator, so that a particular set of parts 
throughout each stage could be related to. the ob- 
servations in any other stage. The oscillator 
frequency, f, and the outputs, E, and E2, re- 
spectively, of the first two stages were recorded 
for each set of parts. A sample of the data ob- 
tained on the dc amplifier during the experimen- 
tal portion of the method is given in Table II. 


TABLE II 


EXPERIMENTAL DATA ON DC AMPLIFIER 


Tp, - I, = difference in plate current, in milli- 
amperes. 

= difference in resistances, in ohms. 

output, in volts. 


e 
i} 
Fhe 
od 


1960 


Results of Regression Analysis 


Regression analysis™ of the experimental data 


on the dc amplifier portion of the oscillator ef- 
fected a solution of (4) for the partial regression 
coefficients which yielded 


Ey Bo + billy ~ Tp.) + be (Ri - R.) +e 
100.17 + (-56.04)(, - I, ) 

1 2 
+ (.016)(R, = R2) a € (5) 


where current was expressed in milliamperes and 
resistance in ohms. The estimate of the variance 


of the error term was 07 = 2.055 volts squared. 
Before use was made of (5), it was tested for 


significance—that is, to determine whether or not 


the equation adequately represented the true con- 
dition, and whether or not the individual terms in 
the equation contributed significantly toward ex- 
plaining the variability of E,. The test of sig- 
nificance for the total regression equation was 
set up as an analysis of variance. = 
proved highly significant. However, just because 
the regression equation satisfactorily explained 
the variability of E,, it would not necessarily 
follow that each term in the equation contributed 


significantly. It was therefore necessary to make 


tests of significance on the individual b values. 
This was conveniently accomplished by means of 
the ‘‘t’’ test, which indicated that both b; and 
bz were highly significant. (In this instance, the 
t values were t; = 15.6 and te = 6.8, both of 
which are much greater than the critical value, 
which is t = 2.048 for 28° of freedom and a 0.05 
level of significance.) Consequently, it can be 
accepted that (5) represents a suitable mathe- 
matical relationship between the output of the dc 
amplifier stage and the specification part-char- 
acteristics chosen. 


Prediction 


Eq. (5) expressed the relationship between the 
output E, of the de amplifier and the specifica- 
tion part-characteristic differences in plate cur- 
rents (Ip) and differences in resistances (R). 
With knowledge of the distributions of these dif- 
ferences for the populations involved, it is pos- 
sible to predict the distribution of the output by 
substituting into the equations 


R, L. Anderson and T. A. Bancroft, ‘‘Statistical 
Theory in Research,’’ McGraw-Hill Book Co., Inc., New 
‘York, N. Y., pp. 153-190; 1952. 
“'Ibid., pp. 191-206. 


MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE 


The equation 


47 
Up = Do + bila + bee 
1 


Om = bi01 + bz02 + 0% 
1 
Here, i and we are the means of the distribu- 


tions of (Ip, - Ip,) and (Ri - Re), respectively. 
Also, 0% and o% are the variances of (i, lp ) 
1 2 


and (R; - Rz), and of is the mean square of 
the unexplained variation, as obtained from the 
regression analysis. 

The distribution of plate current for each tube 
is not directly obtainable from the applicable 
MIL-E-1 specification. However, it may be esti- 
mated from the specification minimum-maximum 
limits, employing the method suggested by Miles. 
Thus, the estimated distribution of Ip may be 
defined by 


T= 0.7 mAdc 
b 
or = .067 mAdc. 


Similarly, from the resistor-procurement speci- 
fications requiring 3.3K + 10 per cent resistors, 
the distribution of R can be estimated as 


UR = 3300 ohms 
110 ohms. 


oR 
From the above estimates, the distributions of 
the differences can be obtained. ” 


Thales a 


1 


= (0.67)? + (.067)? = .0089 mAdc? 
(6) 


Sloe fe =.7-.7=0mAdc 
2 


2 2 2 
C= OF, toy 
b, b, 


and 


J2=tp ~“tR = 3300 - 3300 = 0 ohms 
2 


of =o +0 = (110)’ + (110)* = 24,200 ohms’. 
2 
(7) 


The distribution of the output of the dc ampli- 
fier can now be predicted by 


2 Implicit in (6) is the assumption that Ip, and Ip, 
are distributed independently. Actually, this is not likely 
to be true, because the tubes V, and V2 will probably 
be drawn from the same lot, rather than be selected at 
random from the entire population of type 5719 tubes. 
Since the covariance between Ip, and Ip, is positive, 
the effect of neglecting the term is to overestimate some- 
what the size of o%. The same considerations apply to 


(7). 


48 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRIL 


bo + bili + bee Cathode Follower 


The cathode-follower stage of the oscillator 
was treated in the same manner as the de ampli- 


i 
100.17 - 56.04(0) + .016(0) 


iH] 


Up = 100.17 volts fier, utilizing the output, E,, of the dc amplifier 
: as an independent variable in the regression 
Of, = bio; + boos + @ model 


E2 = bo + bi Ei + belp + bsRs + €. 


= (056.04)? (.0089) + (.016)? (24200) 

The mean and variance of E2 were estimated by 
aa the same procedures that were used in the solu- 
= 28.20 + 6.07 + 2.06 tion of the dc amplifier. The only difference was 

the requirement for knowledge of the mean and 


OSE. variance of E,. However, an estimate of the 
and distribution of E, was available from the solu- 
Onan 6.0 volts. tion of the dc amplifier. 
: The predicted distribution of the output of the 
The numerical values obtained for the regres- de amplifier, as obtained through the method de- 
sion coefficients, b;, and for the variance, Oe scribed in this paper, is presented in Fig. 4, 
of the error term are estimates which differ along with the distribution of output obtained from 
from the true values by unknown amounts. Be- the experimental sample. Comparison of the two 
fore any reliance can be placed on the values of distributions serves to point out the necessity of 
uw and o* calculated using these numerical val- population-distribution estimates. If the output 
ues, some idea of the probable error involved estimates are based on a sample, they may fail 
must be available. A method for calculating the to yield an accurate estimate of the population 
probable error is given in the Appendix. For the mean. Moreover, in most cases, they will be 
output of the dc amplifier, the confidence limits over optimistic on the o” (or tightness) of the 
calculated for the mean and the standard devia- distribution. The importance of the best possible 
tion were + 0.6 per cent and + 11.0 per cent of population estimation is obvious when the circuit 
the predicted values, respectively. If greater performance characteristics of interest are 
accuracy is desired, additional sets of measure- critical to satisfactory circuit performance. 


ments must be taken on the circuit. 


CALCULATED MEAN 


SAMPLE MEAN 


eee PREDICTED DISTRIBUTION 


NUMBER OF OBSERVATIONS 


OBSERVED SAMPLE 


VY /, 
oo 
777 


100 101 102 103 104 
Geer OUTPUT IN VOLTS 


n—2o i [ace SAMPLE LIMITS | 
u+20 
POPULATION LIMITS Sia Eee 


Fig. 4—Output of the dc amplifier. Histogram: measured output of experimental sample. 
Curve: predicted output of dc amplifier population. 


105 Ld Ze 


1960 


Multivibrator 


The model of the regression equation assumed 


for the multivibrator was 
f=b,+biEkeg +be(L +I 
fe) ee 2 ( Dy b,) 


22 bs (I, a Ip ) 
* cut-off > cut-off 


+ ba(R5C, + RgCo) +e. 


This form requires some explanation. The sym- 


metry of the circuit lead to adoption of the use of 


the sum of part characteristics for the approxi- 
mating functions. The choice of the time con- 


stant (RsC; + R4C2) as one of the approximating 


functions was dictated by the theory of operation 
of the circuit. As it was not known how the other 
terms should be selected,. a linear assumption 
was made. Originally, the equation also con- 
tained terms involving Rg, Rz, and some addi- 


tional tube characteristics. Because these terms 


proved not to be significant, they were dropped 
from the equation. Analysis of the multivibrator 
circuit, using the equation given above, yielded 
a predicted mean of 2180 cps and a predicted 
standard deviation of 87 cps for the output fre- 
quency. 


Oscillator 


The problem chosen earlier—to predict the 
distribution of the nominal output frequency of 
the oscillator—is now solved. Under the series 
method of solution used, the predicted output of 
the oscillator is the same as that of the multi- 
vibrator, given above. 

The prediction obtained on the oscillator cir- 
cuit provides an indication of the performance to 
be expected from a large number of these oscil- 
lators. This furnishes the designer a means of 
determining in advance whether his design is 
adequate. The accuracy of the prediction, of 
course, can be completely verified only through 
large-scale production of the oscillators. 


IV. CONCLUSION 


The described method” for predicting initial 
circuit performance consists of three distinct 


131t should be noted that the mathematical apparatus 
nvolved in this method of prediction can, in principle, be 
upplied to predict the distribution of any variable that is 
slosely dependent on other variables whose distribution 
s known or can be predicted. For example, the output 
yerformance of an equipment or system could be pre- 
licted from a knowledge of the performance of the major 


MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE 


49 


steps: 1) the use of regression analysis to ob- 
tain a relationship between circuit performance 
and part performance, 2) the use of the specifi- 
cations to determine the distribution of part per- 
formance, and 3) mathematical combination of 
the first two steps to make a performance pre- 
diction. 


Value of Regression Techniques 


The application of multiple-regression tech- 
niques to circuit problems is valuable in studying 
the relationships between circuit performance 
and the various factors on which it depends. For 
example, if regression analysis revealed that the 
circuit-performance characteristic under con- 
sideration was not dependent on the specification 
part characteristics, additional part character- 
istics—not listed in the part specifications— 
could be included, in order to determine whether 
circuit performance depended upon them. Under 
the circumstances, inclusion of these additional 
characteristics in the regression equation would 
add nothing to the performance prediction. How- 
ever, if the analysis revealed that the additional 
part characteristics did indeed account for the 
variability in circuit performance, it would be 
logical to conclude either that the variability not 
controlled by the specifications must be accepted, 
or that additional tests are required for parts in- 
tended for use in the circuit in question. In any 
event, a firm knowledge of the factors causing 
variability in a particular circuit would be of 
considerable value. 


Importance of the Specifications 


The method described in this paper relies on 
the specifications for information about the dis- 
tribution of part characteristics. It is realized 
that present-day specifications are frequently 
inadequate for supplying this information. The 
use of the specifications, however, is neverthe- 
less better than the alternative, which is to base 
the estimation on actual measurements on the 
parts. To base a prediction on measurements 
taken on a Single lot of parts is never satisfac- 
tory; and, while basing the prediction on meas- 
urements taken on a large number of parts is 
better, it offers no real assurance for the future. 
If more realistic performance predictions are to 
be made, specifications must be changed to set 
forth distribution requirements on part charac- 
teristics. 


subassemblies. An important consideration would be the 
sample size required. 


50 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRIL 


Value of Performance Prediction 


By means of a circuit-performance predic- 
tion, a designer can find out directly if his de- 
sign will actually fall on the design-center value 
intended, and if the spread of the performance 
characteristic will be too great to tolerate. 
When the design is not satisfactory, the predic- 
tion not only brings the error to light, but also 
provides a ready means for calculating the nu- 
merical values of the changes that must be made 
in the nominal values and tolerances of the parts 
in order to obtain the required output. 


APPENDIX: CONFIDENCE LIMITS 


In most cases, it would be desirable to have 
tolerance limits on the distribution of the circuit 
performance characteristic, Y. If uw and o” 
were known exactly, and if the distribution of Y 
were assumed to be normal, one could state that 
95 per cent of the distribution of Y lies within 
the limits p + 1.960. Since yp and o’ can only 
be estimated, one would like to be able to finda 
number L> 1.96 such that the estimated limits 
ji. + Lo contain 95 per cent of the distribution of 
Y, 95 per cent of the times that these limits are 
calculated. This problem has been solved in the 
case of samples from a normal distribution. 
The problem here is more difficult, however, for 
(} and o” are not independently distributed in 
general, and the distribution of 6” is nota 
simple Chi-Square. Instead of tolerance limits %. 
on the distribution of Y, confidence limits on i. 
and 0” canbe derived. These limits do provide 
some idea of the probable magnitude of the error 
in ph and @”7 due to sampling errors in the cal- 
culated values of the b’s and Op . In addition to 
this sampling error, of course, the error due to 
estimation of the yj andthe oj; from the 
specifications must be considered in any prac- 
tical problem. 


Confidence Limits on yu: 


Confidence limits on pw are relatively easy, to 
obtain. i is a linear combination of Dos - ae 
Thus 


NN A A A 
H=bo+bibit...+bdyy,. 


It can be shown” from regression theory that 


* A. Wald and J. Wolfowitz, ‘‘Tolerance limits for a 
normal distribution,’’ Ann. Math. Statistics, vol. 17, pp. 
208- 5215; June, 1946. 

5 Hald, op. cit., pp. 638-642. 


the variables bo eee: be are normally distrib- 
uted with means bo... by, and variances and 
covariances given by 


/\ 


LOAN 
[var (bo) Cov(bo,b1) . . . Cov(bg,b},) 

YN A 
Cov(b1 jbo) Var(b,) 2% . Cov(by,)),) 


re A : yAN AN : A 
| Cov(by,b9) Cov(by,bi) . . . Var(by,) 


i =e = = 
= +o UK{XjCij, - EXiCj rarer ae? .< 1 67) 

- 2XiCyj Cu Cik O55 5 

2 

s [V og 
- EX4Cyy Caer Cr 


The matrix of the quantities Cjj is first 
found by taking the inverse of the matrix of the 
sums of squares and products of the deviations 
of the X; about their means. The elements of 
the first row and column of the matrix V can 
then be calculated. It is also possible to calcu- 
late V directly, since V is equal to the inverse 
of the matrix of the sums of squares and prod- 
ucts of the X;, but the above form is more con- 
venient when the analysis is performed on a desk 
calculator. 

Since fi is a linear combination of normally 
distributed variables, it follows that {i itself is 
normally distributed with mean 


AS 
E(f) = E(bo) + EQbi)ur +... + Eu, 
=Do + Viti < C DEL 
=i, 
and variance 
A N 
E(it =a) = E[(bg - bo) + (bi - bi)ua + 
N 
+ (by - by)ux]” 
N 2 A 
= Var(by) + wiVar(bi) +... 
2 TaN Nt ak 
+ py.Var(by) + 2u1 Cov(bg,b).) + 
+ 21 U2 Cov(b; ,be) 40i5e3 
whe, eens Mal 1 


1960 MARINI, WILLIAMS: EVALUATION OF CIRCUIT PERFORMANCE D1 


Consequently, (i - )/Qo, is normally dis- 
tributed with zero mean and unit variance. 


Also, from regression theory, S? = X(¥ - Y)?/ 


(n- k- 1), and (n- k- 1)S¢/o? hasa X? dis- 
tribution with (n- k - 1) degrees of freedom. 

Therefore (ji - )/Qa¢(S,/og) hasa t dis- 
tribution with n-k- 1 degrees of freedom, and 
confidence limits are given by 


fee See tS (8) 
Confidence Limits on o? 


Exact confidence limits on o* have not yet 
been worked out. Since the distribution of ¢? 
becomes approximately normal when the number 
of degrees of freedom igs large and when the t 
values associated with b; are large, it is pos- 
sible to obtain approximate confidence limits by 
deriving the expression for the variance of 6°. 
The expression for this variance will contain the 
population parameters bi,...b,, and og. An 
approximate value for the variance then can be 
calculated by substituting the estimates of these 
parameters into the expression, and an expres- 
sion analogous to (8) will give approximate con- 
fidence limits on o”. The derivation is best 
carried out in matrix notation. . 


fete = (by... -, Dy leand B= (by,5 2, by], 


2 
O1 Ow... Oik 


and let S= 


Cy, Cho-- + Cer 
then, 
o? = BSB' + s? 
where a prime is used to designate the trans- 


pose of a matrix, and S¢ is the unbiased esti- 
mate of go . 


6? = [(B - B) + BJS[(B - B)+ B]'+ $2 
BSB' + (B - B)SB' + BS(B - B)' 
+ (B -'B)S(B - B)' +S? . 
If it is assumed that the t values calculated 
in the regression analysis are all large, then the 
range of b; - bj will be small compared to }j. 


The fourth term in the equation above can then 
be neglected and 


ll 


6? = BSB'+ 2BS(B - B)' +S? . 


Taking the expected value 
E(o”) = BSB' + py een 


? 


also, 
E(6" - g*)* = E[2Bs(B - B)' + (S? - o%)]?, 
then 
E(3? - 0”)? = E[4BS(B - B)'(B - B)SB' + (S? - 2)? 
+ 4(Sé - o¢)BS(B - B)'] 
= 4BSCSB'oé + 20¢ /f, 


Since the variance” of S¢ is 20¢/f, and since 
S€ is distributed independently of the bj, 
Because the values of B and of in the above 
expression are not known, these quantities will be 
replaced by their estimates. The error intro- 
duced will not be serious if the t values and de- 


grees of freedom are large. 
E(3? - 07)? = [4BSCSB! + 28? /f]o?2 = p?o?. 


Approximate confidence limits on o” are there- 
fore 
Go tS. P <9. <e0 74 tS... (9) 


Application to the dc Amplifier 


The estimated mean at the output of the dc 
amplifier is 


A A A AN A 
Ue =bo+ biti + bate =be. 
1 


Q in (8) reduces to 


ee ssa Tee = 
Q= y 7+ XiCu + 2X,:XeCiw + X2Cor . 


The value of t for 28 degrees of freedom at 
the 0.05 level of significance is 2.05. Ninety-five 
per cent confidence limits on the mean are, 
therefore, 


PE = BR =e tSeQ 
= 100.17 + (2.05)(1.434)(. 1982) 
ME = 100.17 + .58 volts. 


From the symmetry of the dc amplifier cir- 
cuit, it is evident that the true value of jw E, is 


one-half of Bt, or 100 volts. The agreement is 
good. 

The estimated variance at the output of the dc 
amplifier is 


% Ibid., pp. 276-278. 


52 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRIL 
oF = b20% + B20? + S? (10) In this form, it is possible to interpret the sig- 


The expression 


2 — ABSCSB' + 


coder ie | P 
+ 
C21C22| [O 3 be aSe/f 


2S; /f 


n (9) becomes 


P? = 4[bibe] [oi O 
O a3 


4 = 4(b30%4Cu ar bigs Co a 2b be C 20%03) + 25e7 i 


N 
Making use of the equations tp, = bi/SeV Cu 
and tp, = be /SeV C22, P*® becomes 


Ayers 


— = 4bi0%/Sétf,, + 4b2 02 /Sé th, ap 8b2.620%02C 2 / 


tet VC gs Ce me ae2ozy/5 


Coo 


Substituting for P into (9), the confidence limits 
on 0” can be written as 


Gg? ee [(b0% - 2t/tp ,)” ae (b202 . 2t/ty, ie 
tr 2(b20% - 2t/ty )(bace C at/ty YC wN Civ Cos 
+ ast t? /t]2. 


nificance of the various terms in the brackets. 
The quantity t/tp, found in the first term is, 


roughly speaking, the fractional error in be 
Doubling this error gives approximately the frac- 
tional error in b, squared. The first term, con- 
sequently represents the square of the absolute 
error in the first term of (10). The second term 
represents the square of the error in the second 
term of (10). If b; and be were independently 
distributed, the errors in the first and second 
terms of (10) would add as the square root of the 
sum of the squares. The third term in the equa- 
tion above, which contains the correlation coeffi- 
cient Cr NC 11V¥ Coz takes into account any cor- 
relation between 6, and bas while the last term 
above represents the error in the estimate Se 
which, being independently distributed, adds as. 
the square root of the sum of the squares. 

Substituting numerical values obtained in the 
dc amplifier problem, there results 


of = 36.3 + [54.5 + 13.5- 3.84 1.3]? 
= 36,3 +.8.1= 36.3 + 22%. 
Consequently, 


, = 6.0 + 11% volts. 


1960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 53 


RELIABILITY USING REDUNDANCY CONCEPTS* 


L. DEPIANt and N. T. GRISAMOREt 


Summary—This paper introduces a new method 
of using redundancy to obtain reliable operation 
of electronic circuits. Switching circuits are 
used as examples to illustrate the method. A 
comparison is made between the majority logic 
method and the averaging method proposed in 
this paper. The comparison shows that the av- 
eraging method should provide a greater circuit 
reliability than the majority method if the com- 
ponents of each circuit have the same reliability. 


INTRODUCTION 


Reliability, the probability of occurrence of a 
desirable result, has always concerned man. As 
soon as he learned to control the outcome of any 
particular function, he started striving for higher 
reliabilities. In modern days, this desire is still 
the same; and as our world has grown more com- 
plex, higher reliabilities have become more vital. 

In some cases an undesirable result or error 
is recognized as such; we know what the result 
should be and any erroneous nature is readily 
recognized. In other cases, however, the out - 
come is not known. We are seeking information 
and are not in a position to know the correctness 
of a particular outcome. Such is, for example, 
the nature of measurements; if the measuring 
instrument is defective, the result will be in 
error and in general the error will be unrecog- 
nized, at least in the immediate sense. 

This is also the nature of the problem concern- 
ing reliable operation of systems composed of 
digital circuits. At present large scale digital 
computers use from 1,000 to 10,000 gate circuits, 
each circuit composed of a number of passive 
linear elements and one or more non-linear de- 
vices such as diodes, transistors, vacuum tubes, 
and/or magnetic cores. If to each of the gates we 
attach a number representing the probability of 
reliable operation of the circuit for some definite 
time interval, then an upper bound of the system 
reliability can be computed. For example, con- 


*This work was supported by the Office of Naval Re- 
search, Washington, D. C., under Contract No. N70n41906. 

fElec. Engrg. Dept., The George Washington Univer- 
sity, Washington, D. C. 


sider the case where the gate circuit is 99.999 
per cent reliable, meaning the gate has a proba- 
bility of 0.99,999 of operating correctly during 
the specified time interval. The reliability of a 
computer composed of 10,000 gates would then be 
only (0.99,999) °° or about 0.905. Even if the 
computer used only one tenth of its gates during 
the specified time interval, the reliability would 
be only about 0.990. 

To regard the problem in a different aspect, 
Suppose the circuit in question operates at a fre- 
quency of one megacycle, which is a reasonable 
figure for present day computers. If the system 
is composed of 1000 gates, each having a relia- 
bility of 0.99,999, then the system would on the 
average generate errors at the rate of 10,000 per 
second. 

Regarding the above figures, one is inclined to 
wonder how complex systems can be made to op- 
erate usefully and economically. In actual sys- 
tems, however, only a few component parts oper- 
ate continuously, thus reducing the probability of 
failure of the whole system. Secondly, in some 
cases, an error produced by one of the gates will 
not be propagated throughout the whole system. 
Regardless of these factors, however, the reliable 
operation of large scale systems is still a prob- 
lem, aS evidenced by the amount of effort expend- 
ed in increasing component reliability, devising 
error-checking schemes, marginal checking, pre- 
ventive maintenance, etc. 


NATURE OF THE PROBLEM 


In general, computer and logical circuits, con- 
sisting of gates, are designed to represent the 
function 


A = F(a,b,c,------ }; (1) 


where A is the output, a,b,c,------ are the in- 
puts, and F(a,b,c,---) is some particular Boole- 
an function of the inputs. Since this is a Boolean 
equation, the variables can assume only two pos- 
sible values corresponding to two possible states 
of the circuit which we will call ‘‘on’’ and ‘‘off’’. 
The problem of reliable operation is one where 
we wish the circuit shown in Fig. 1 to represent 
function (1) with a given probability. In general, 
by the simple process of repetition (redundancy) 


54 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


a 
b 
c A 

output 


gate circuit 


inputs 


Fig. 1—Generalized gate circuit represented by (1). 


one can increase the reliability of a system. For 
instance, if there is a number of circuits, all the 
same, in parallel, a failure of one will not be cat- 
astrophic since the others will still carry on their 
function. The difficulty here, however, lies in the 
fact that outputs and inputs may not always be in 
a state labeled ‘‘on’’ or ‘‘off’’. As an example, 
suppose an output of 2 ma is labeled as ‘‘off’’ and 
one of 10 ma as ‘‘on’’. Because of malfunctions, 
however, (defective components, spread and/or 
drift in characteristics, aging, etc.) the output 
may be 5 ma, neither ‘‘on’’ nor ‘‘off’’. A dividing 
line may be thought of, for example, at 5.5 ma and 
any output smaller than this value labeled as 
‘‘off’’ and any larger output as ‘‘on’’. However, 
even this scheme is not satisfactory since there 
may still be ‘‘off’’ outputs larger than 5.5 ma 
giving rise to an error, Since they will be consid- 
ered as ‘‘on’’ by any device sensing the output. In 
general, a probability density will exist for the 
‘‘off’’ and ‘‘on’’ outputs, as shown in Fig. 2. It is 
assumed that this distribution takes into account 
any effect caused by a distribution of amplitudes 
of the input signals. An output to the right of the 
dividing line will be considered as ‘‘on}’’ and to 
the left as ‘‘off.’’ Clearly, the area A will give 
the probability of error in the ‘‘off’’ state and 
area B the probability in the ‘‘on’”’ state. The 
problem is to decrease these error probabilities. 


output amplitude 


Fig. 2—Probability density, f(x), for outputs of amplitude x 
from gate circuits of the type shown in Fig. 1. 


APRIL 
THE MAJORITY METHOD 


Recent efforts have been made to apply the 
principle of redundancy to large-scale systems, 
so as to increase the reliability of these systems. 
These efforts have been based on ideas proposed 
by von Neumann.’ The scheme is to use identical 
redundant circuits, each having the same input 
signals, and one majority device which senses the 
outputs of the redundant circuits. The majority 
device then produces an output which is in agree- 
ment with the majority of the outputs. For ex- 
ample, consider three redundant circuits. If all 
three outputs fall on one side of the dividing line 
in Fig. 2, let us say to the left, or two on the left 
and one on the right, the output will be considered 
by the majority device as ‘‘off.’’ Let us see what 
improvement has been obtained in this fashion. 
Consider the ‘‘off’’ probability density f,(x), of 
the output x, and let y be the value of x at the 
dividing line. The probability of correct inter- 
pretation using one circuit will depend on the 
position of y and will be R, (the probability of 
x less than y), shown in Fig. 3. 


Ri=J* f1(x)dx. (2) 
-0co 

If three circuits are used in the majority scheme, 
a correct interpretation will be made 1) if all 
three circuit outputs are less than y giving: 
probability = Ri; or if _2) two of the outputs are 
less than y and one is greater than y giving: 
probability = 3R4(1- Rj). 


------------------%e 


Fig. 3—Reliability, R,, as a function of the distribution of 
the output amplitude, x. Ri, represents the area under 
curve between the limits -oo and y. 


J. von Neumann, ‘‘Probabilistic Logics and Synthesis 
of Reliable Organisms from Unreliable Components,”’ 
Automata Studies, Princeton University Press, Princeton, 
N. J.; 1956. 


960 


The new probability of correct interpretation will 
then be 


Rs ,m = Ri(3 - 2Ri) (3) 


which is an improvement over R, if R, is 
ereater than 0.5. 


If five circuits are used ina majority circuit, 
the new probability is found to be 


Rsm = Ri(10 - 15R: + 6R4) (4) 
and for a majority of seven circuits 
Rz,m = Ri(35 - 84R, + 70R% - 20R3). (5) 


These functions are shown in Fig. 4. 


Ram 


Fig. 4—Reliability, Rn m, of majority method 
compared with reliability of 
single circuit, R,. 


It is seen that, if the original probability R, of 
ybtaining a correct interpretation of the output for 
1 Single circuit is larger than 0.5, the majority 
sives a definite improvement in reliability, as- 
suming, of course, a perfect majority device. 
Furthermore, this improvement is increased as 
he number of redundant circuits is increased. 

The majority method, however, makes no use 
Mf the probability density curve of the circuit out- 
mit. R, can have the same value for a different 
1(x) anda different y. Rpm is derived direct- 
y from R, and in that sense is independent of 
he form of f(x). In other words, the majority 
nethod does not make full use of the information 
ssociated with f,(x). It could be asked at this 
int: is the majority method the best reliability 


DEPIAN, GRISAMORE: RELIABILITY USING REDUNDANCY CONCEPTS a)s) 


improvement? Could it not be that a different 
method, possibly making use of the nature of 
f(x), would give better reliabilities? 


THE AVERAGING METHOD 


The reliability associated with the ‘‘off’’ state 
could be improved by increasing y (see Fig. 2), 
but this would be done at the expense of decreas- 
ing the reliability of the ‘‘on’’ state. The value of 
y is to some extent fixed and cannot be used di- 
rectly to improve the over-all reliability. 

Returning to Fig. 3, it may be seen that the 
reliability would increase if the probability den- 
sity f1(x) could be adjusted so that less of the 
area under the curve lies to the right of y. Sup- 
pose that n redundant circuits are to be used to 
increase the reliability, and that their outputs are 
X1, X2, X3, ..., Xp, each following the same 
probability density f,(x). Let 


’ Xn) (6) 


be a function of these outputs. s will follow a 
probability density fp(s) which will be different 
than fi(x) (see Fig. 5). The reliability (proba- 
bility that s is less than y) will be 


s= g(X1,X2,X3, Co eee 


Rn,g = “ f,(s)ds. (7) 


Fig. 5—Desired distribution function of s compared with x 
for the ‘‘off’’ state. 


It is conceivable that one might now have 
Rn,g > Rnym (8) 


where Rp,m is the reliability obtained with n 
redundant circuits, by the majority method. 

The problem is centered in (6): can a function 
g be found which satisfies inequality (8)? A gen- 
eralization of this would be: can one find a func- 
tion g such that inequality (8) is maximized? An 


56 


attempt will be made to answer the first question 
by using a particular function g. The answer to 
the second question is discussed at the conclusion 
of this paper. 

Consider the function 


(9) 


aor ee hint 


The x’s, being the outputs of the redundant cir- 
cuits, will in general be independent and will each 
obey the same probability density curve, f1(x). 

A general method of finding the probability 
density, f,(s), of s is by use of the Fourier 
transform. The Fourier transform of f(x) is 

+00 A 
gi(v)=J £1 (x)elV¥dx. (10) 


Shag ee Xa Kas tee aie 


It can be shown that the Fourier transform of the 
average s is relatedto @1(v) by’ 


On(v) = [62(v/n) ]” 


from which, by using the inverse Fourier trans- 
form, the probability density fy(s) of s may be 
found. 


(11) 


+ oo 3 
fy(s) = (1/27) J on(v)e!*Vav. (12) 
- co 
A few examples will be examined to establish 
the merit of this averaging method. 


Normal Distribution Let x follow a normal dis- 
tribution density with a mean at x= wu, anda 
standard deviation 0, giving 


o (x - 1)? 
ti ea(1/ ony aaa ac (13) 
Eq. (10) gives 
FSO. = (x a ea)" : 
wily) =(1/o,Var)f e 27 it® ay 

or 

1(v) = 7 av)" /2 Vii (14) 
from which one gets 

aa ory) /2n ivi (15) 


Applying the inverse Fourier transform gives 


+00 _ 2 . a 
fn(s) = (1/2n) fe (717) 1m givH2g"I8¥ ay 
-0o 


ante Uspensky, ‘‘Introduction to Mathematical Prob- 
ability,’? McGraw-Hill Co., Inc., New York, N. Y.; 1937. 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


I 


APRIL 


or 
5 (s - Ma)? 
f,(s) = (1/onv27) e 2 op (16) 
where 
On 01 vie (17) 


The average s will also obey a normal dis- 
tribution curve with the same mean but with a 
smaller standard deviation (see Fig. 6). The in- 
crease in reliability is evident. Let us now 


uy Xors 


Fig. 6—Comparison of distributions of x and s, where s is 
the average value of x. 


compare this with the majority method for n = 3. 
Let R, = 0.90. A majority system will give 
Rsm = 0.972. The average will give Raq = 


0.987. Fig. 7 gives a direct comparison between 


the majority method (Rpjm) and the averaging 
method (Rn ,a) for n= 3 and n=5 respective- 
ly. It can be seen that for R, (reliability of a 
single circuit) larger than 0.5, the averaging 
method will always give better results than the 
majority method. For R, less than 0.5, both 
methods will give reliabilities smaller than R,. 
This example dealt with an fi(x) following a 
normal distribution. This, of course, is not a 
very realistic distribution, since the output x is 
limited, by the physical nature of the circuit, to 
an upper and lower bound. In the following ex- 
ample we consider the more realistic case. 


Pearson Distribution Let f(x) be represented 
by a Pearson’s curve of type I, where ai and az 
are the lower and upper limits respectively. 
-(p, + 
£1 (x) = (a2 - a1) (Pa Wi, + qi- 1)!/(pi - 1)! 
(ar- D1]@-an)P*7 Ya - 927 (a8) 


The general shape of (18) is shown in Fig. 8. 
Such a curve is more realistic than the normal 


1960 


DEPIAN, GRISAMORE: RELIABILITY USING REDUNDANCY CONCEPTS 


Figure 7 


o7 


Fig. 7—Comparison of reliabilities obtained by majority 
method and average method for two different values 
of redundancy. 


distribution and was found to give a good fit for 
transistor gate circuits. The sharpness of the 
curve depends on the form factors, pi and qi. 
The coefficient 


= + -] 
(a2 - a1) (Pa a Nit ai- D'V/(pi- 1) !(qi- 1)! ] 
normalizes the curve so that 


f?fi(%) dx et. (19) 
ay 


This imposes a lower bound on p; and qi: pi 
and qi must be equal to or greater than 0. The 
mean occurs at X = iii, where 

a2 


es J xfi(x) dx= (qiait pia2)/(p1 + qi). (20) 
a1 
fax) 
1< 4) Pr>q 
x 
(@) \ 


Fig. 8—Two examples of Pearson’s Type I distribution - 
curve with limits normalized to 0 and 1. 


R, 
(b) 
The standard deviation go, is 
2 al i 
On = 4 ,& = ui)? £1(x)dx |? = [((p1qi)/(pi + Chis ig 
(a2 - a1)/(pit qi). (21) 


If £,(x) is normalized and the x-coordinate is 
adjusted so that the limits a: and az become 0 
and 1 respectively, (18) becomes 


f1(x) = [(p1+ qi - 1)!/(pi- 1) 1(qi- 1)!] xP?” : 


(1 s x) 7 - 1. 


This normalization does not effect the values of 
pi and qi, but the mean and standard deviation 
become, 


(18a) 


(20a) 
(21a) 


[la = pi/(pi+ qi) 1 
ners [(pid1)/(P1 8 EV 1)]7[1/(p1 sea) = 


By following the process outlined by (10) 
through (12), it is found that the distribution den- 
sity of s (the average of the outputs of n re- 
dundant circuits) is again a Pearson’s type I 
curve. 


fy(s) = [(Pp + Gn - 1)!/(Pp - YD! (Gp - Hije™ | : 
(1- syn! (22) 
where 
Pn = Paln + (n - 1)/(pi + 1) ] (23) 
and 
dn = qi(n + (n - 1)/(pi+ q1)]- (24) 


58 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


The new mean Uy is the same as Ui1, 
Ln = Pn/(Pn + Gn) = pi/(pit+ qi) = i = (25) 


while the standard deviation is reduced by a fac- 
tor of 1/vn: 


On = [(Ppdn)/(Pn + Gn + 1)]7[1/(pn + Gn)] = o1An. 
(26) 


Moreover, the lower and upper limits 0 and 1 
(a1 and az in the non-normalized coordinate) 
have remained the same. This result, conserving 
the mean while reducing the standard deviation, 
was also found for the case of the normal distri- 
bution.” The decrease of the standard deviation 
points to a sharper distribution density and in 
general higher reliabilities are to be expected. 

A comparison with the majority scheme will 
now be made. Let us first examine the special 
case where f,(x) is a constant for one of the 
distributions. This is not very realistic, but it 
offers an indication of the advantages of the aver- 
aging over the majority method. For this distrib- 
ution, we must have 


f£,(x) = he 
Dead ieael. 


(27) 
(27a) 
The probability (reliability) of x less than y is 


y 
R'; = Jo f(x) dx = y. (28) 


If three (n = 3) redundant circuits are used, 
the distribution of the average will be 
fs(s) = [(7!)/(3!3!)] s°(1 - s)° 
which results from the fact that 
Ds = q3 = 4. 
This gives 
f,(s) = 140 s°(1 - s)® 


and the new probability (reliability) of s less 
than y will be 


(29) 


y 
R3,a = J fs(s) ds = R'4(35 - 84R1 + 70R3 - 20R2). 
(30) 


Comparing this with (5) we see that, for this 
case, 


(31) 


In other words, for the case of pi and qi equal 
to 1, the averaging method with three redundant 
circuits gives the same result as the majority 
method with seven redundant circuits. It is easy 
to show that, in general, for pi = qi = 1, 


R's.a = Rim A 


APRIL 
Riva = R(3n - 2),m » (32) 
and since for n greater than 1, 3n- 2 is 
greater than n, thus 
Rina? Rn,m - (33) 
Eq. (32) may also be written as 
Rn,m = Rn + 2)/3,a (32a) 


which states that the majority method with n re- 
dundant circuits will give the same reliability as 
the averaging method with (n+ 2)/3 redundant 
circuits and p; and q,; equalto 1. In other 
words, 


R, n- 1 


a Olt x(x) [ae x, (34) 


f(x) 


Fig. 9—Special case of distribution curve, used for com- 
parison between averaging and majority methods. Curve 
on the right shows possible distribution of ‘‘on’’ 
signals that might occur. 


If pi and q, are larger than unity, the averag- 
ing method will give still higher reliabilities 


Rnja? Rn: 
We conclude that 


Rn,a > Rom - (36) 


Let us consider the case where p,# qi. Since 
the choice of the labeling of ‘‘off’’ and ‘‘on’’ is 
arbitrary, we may always state that one would 
correspond to pi> qi and the other to p,; <q: 
(see Fig. 10). Furthermore, if the state with 

Pi > qi is normalized to the same coordinate as 
the one with pi < qi (shifting of upper and lower 


(35) 


960 


f,(x) 


Fig. 10—Representative normalized distribution. 


limits), we may consider both ‘‘off’’ and ‘‘on’’ 
states with pi< q; (see Fig. 11). This has actu- 
ally been found to be the case for transistor gates. 

For simplicity, let us assume that, when nor- 
malized to the same coordinate, both ‘‘off’’ and 
*fon’’ states present the same p; and q;. This 
suggests that the dividing line x = y should be 
placed at y = 0.5 as in Fig. 12. The single circuit 
reliability is 


0-5 
» i+ qi- 1)! pr qi - 1) 
R, = SS! x (1-x) eel He) 
(pi - 1)! (qi - 1)! 0 


The averaging method with n redundant circuits 
will yield a reliability 


0.5 
. x i = 1 
f Pn ee 5) 2 


(pn + = 1)! (1 
0 


= Gn 
Rn,a = (pn - 1)! (an - 1)! es 


(38) 


where p, and q, are given by (23) and (24). 
The majority method with n redundant circuits 


f(x) 


0 | 


Fig. 11—‘‘On’’ curve normalized to same limits 
as ‘‘off’’ curve. 


DEPIAN, GRISAMORE: RELIABILITY USING REDUNDANCY CONCEPTS 59 


f(x) 


Fig. 12—Identical ‘‘on’’ and ‘‘off’’ distributions used for 
example in text. 


will yield a reliability Ram, given by (34). 
Figs. 13 and 14 demonstrate the advantage of 
the averaging over the majority method for 
Pi <q: and three (n = 3) redundant circuits. 
Fig. 14 compares the number of errors per 1,000 
operations of a single circuit with those to be ex- 
pected from the majority and the averaging meth- 
ods. For example, for pi = 3 and q: = 8.9, a 
single circuit will give an erroneous output about 
30 times in 1,000 operations. The majority 
method would give about four erroneous outputs 
and the averaging about one. Fig. 15 gives the 
ratio of the number of erroneous outputs by ma- 
jority to the number by averaging. For example, 
for pi, = 2.8 and q, = 9, the majority method 


MAJORITY 


SINGLE 
123610 2s € 123E 


AVERAGE 


Fig. 13—Expected errors per thousand operations: a) single 
circuit; b) majority method with n of 3; c) averaging 
method with n of 3. 


60 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Fig. 14—Ratio of errors expected per thousand operations 
using the majority method, to those expected 
for the averaging method. 


will produce five times as many errors as the 
averaging method. 


CONCLUSIONS 


It has been demonstrated that the reliability of 
a system may be increased by the use of redun- 
dant elements. This makes possible the use of 
inexpensive components, known to be less reli- 
able, in place of more reliable and more expen- 
Sive ones now used. Furthermore, it has been 
shown that redundancy by the averaging method 
will generally give better reliability improvement 
than the majority method. In addition, it is usu- 
ally simpler to construct an averaging circuit 
(é.g., output currents through a common load) 
than a majority circuit. 


APRIL 


f,(s) 


a b 


Fig. 15—Limits of distribution function, f,(s), as the 
redundancy is increased. : 


Although the averaging method will yield good 
results, it is still probably not the best. The goal 
would be to find functions s, (6), such that the 
distribution densities of the ‘‘off’’ and ‘‘on”’ 
states approach (as n is increased) the two 
Dirac functions 6(s - a) and 6(s-b). (See Fig. 
15;) 

The particular function s_ to be chosen will, 
in general, depend on the distribution density of 
the particular single circuit to be used. For ex- 
ample, in some cases a weighted average might 
prove advantageous, the weighting factor depend- 
ing on the nature of f,(x). : 


ACKNOWLEDGMENT 


The authors would like to acknowledge the aid 
of Mrs. Louis Depian and of G. Uyehara in the 
preparation of this manuscript, and of L. Rubin 
in calculations required to construct the graphs 
shown in Figs. 14 and 15. 


1960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 61 


HOW CAN WE ATTAIN HIGH RELIABILITY OF COMPLEX 
MILITARY ELECTRONIC EQUIPMENT?* 


MORRIS HALIO' 


Summary—Each piece of military electronic 
equipment passes through various phases in its 
normal life cycle. These are planning, design and 
development, pilot production, manufacture, trans- 
portation, storage, operation and maintenance. 
Each of these stages is replete with opportunities 
for the introduction of unreliabilities. This paper 
points out the pitfalls which may be encountered 
and makes specific recommendations to avoid 
these so that the full amount of potential reliabil- 
ity may be realized in the final equipment. 


By this time, most of us have been made aware 
of the growing complexity of weapon systems 
utilizing countless electronic circuits with myr- 
iads of parts and the terrifying reliability prob- 
lems that arise as a consequence thereof. There- 
fore, let us proceed immediately to the crux of 
the matter—namely, what can we do to remedy the 
situation? The reliability problem with its many 
facets is reminiscent of the many-headed hydra. 
Each of these heads must be removed to conquer 
the beast. If we were to trace a piece of equip- 
ment through its life cycle, we might arrive ata 
flow chart such as that shown in Fig. 1, including 


DESIGN AND 
DEVELOPMENT 


PILOT 
PRODUCTION 


PLANNING 


OPERATION AND 
MAINTENANCE 


TRANSPORTATION STORAGE 
* 


MANUFACTURE 


Fig. 1—Flow chart in life cycle of equipment. 


such factors as: planning, design and development, 
pilot production, manufacturing, transportation, 
storage, operation and maintenance. Each of these 


*This paper was presented at the Fifth Joint 
Military-Industry Symp. on Guided Missile Reliability, 
Chicago, Ill., December 8-10, 1958. 

+Headquarters Air Defense Command, USAF, Colorado 
Springs, Colo.; formerly at Ballistic Res. Labs., Aberdeen 
Proving Ground, Md. ; 


stages presents an opportunity for additional unre- 
liabilities to be introduced. It is obvious that if 
the design is such as to limit the maximum poten- 
tial reliability to a certain value, then poor manu- 
facturing processes or the deleterious effects of 
storage, for example, can only serve to reduce the 
ultimate reliability of the equipment. It is there- 
fore imperative to minimize the unreliabilities 
introduced by each step in the process. 

At this point, some of the terms used in this 
paper can be defined. The first one, of course, is 
“‘reliability.’’ Definitions of this term vary from 
some long and complicated ones to the simple one, 
‘“‘When you press the button, it goes.’’ This author 
prefers the definition employed by one of the task 
groups of AGREE (Advisory Group on Reliability 
of Electronic Equipment of the Office of the Assis- 
tant Secretary of Defense). This is: ‘‘Reliability 
of an item is the probability that it will perform 
without failure a specified function under specified 
test conditions for a required period of time.”’ 
Incidentally, the various task groups of AGREE 
did not all agree on a definition for this term. 


iv 

Mathematically, R(t) =e ™, where R(t) is the 
reliability, t is the variable time and m is re- 
ferred to as the reliability index. The latter is 
defined as the average measure of the equipment 
failure rate expressed in mean-time-between- 
failures. The reciprocal of this quantity is known 
as the failure rate and is most conveniently ex- 
pressed as number of failures per thousand hours. 

Fig. 2 depicts a typical statistical curve of the 
variation of failure rate during the life of an 
equipment. The high rate of early failures is 
attributable to poor parts control, manufacturing 
techniques, inspection and quality control. At 
time A, the defective parts have been eliminated 
and the failure rate is essentially constant until 
time B, when the failure rate begins to increase, 
signifying the end of useful life of the equipment. 

The terms employed for the various subdivi- 
sions of an equipment are still not fully stand- 
ardized; therefore, this author would like to 
recommend the following definitions which are 
modifications of those listed in DOD Directive 
3232.2. 


62 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


EARLY FAILURE 


PERIOD 
NORMAL 
OPERATING ——++-— WEAR OUT PERIOD 
PERIOD 


FAILURE RATE 


TIME 


Fig. 2—Failure rate of equipment vs time. 


Part: An item which cannot be disassembled 
without destroying its identity; e.g., resistor, 
capacitor, switch, relay, socket, bearing, bolt. 

Subassembly: An aggregation of parts mounted 
together for convenience and incapable of perform- 
ing any function prior to being incorporated into 
an assembly; é.g., a terminal board with parts 
mounted on it, an IF transformer with tuning slug 
and mechanism. 

Assembly: A combination of parts or sub- 
assemblies or both capable of performing a func- 
tion; é.g., amplifier, oscillator, modulator, filter, 
power supply, junction box. 

Component: An aggregation of assemblies, 
constituting an element of an equipment and per- 
forming a function necessary to the operation of 
that equipment; e.g., transmitter, receiver, ro- 
tating antenna, frequency standard. 

Equipment: A group of components capable of 
performing a specified function; é.g., a radar set, 
a gun director. 

You will notice that the subdivision formerly 
known as ‘‘component’’ is now referred to as 
‘“‘part,’’ while the term ‘‘component”’ is reserved 
for designating a group of assemblies. 


PLANNING 


The first phase of the reliability program is the 
planning stage. It is necessary that quantitative 


APRIL 


specifications for equipment reliability be incor- 
porated into the development contract. The present 
low level of reliability may be partly ascribed to 
failure to do so. In the past, a manufacturer who 
designed a new system has had to meet certain 
performance specifications. However, he has been 
under no legal obligation to include reliability 
among these. As a result, reliability has been 
treated as an afterthought. Long experience has 
shown that this is too late to improve reliability. 
Once the design has been frozen, the failure rate 
of electronic equipment cannot be appreciably de- 
creased by debugging. High reliability cannot be 
achieved unless this factor is taken into account 
during the preceding stages. 

Reliability requirements should originate with 
the groups responsible for the operational require- 
ments and military characteristics of the various 
services, since it is through these groups that the 
services must determine how they intend to ac- 
complish their mission. These figures must then 
be incorporated into the development contracts for 
new equipment. Proper planning is the foundation 
on which the reliability structure is based. 


DESIGN AND DEVELOPMENT 


Design and development follow the planning 
stage. The reliability of the completed equipment 
will depend on that of the parts employed as well 
as the circuitry in which they are utilized. It is 
well known that the over-all reliability of an 
equipment where the parts are placed in series? 
can be expressed by 


R over-ail= Rix ROX Re Aa 


z.€., the over-all reliability is the product of the 
individual reliabilities. The simplifying assump- 
tion has been made that there are no reliability 
interactions among the various parts. Evidently, 
for very complex equipments, the reliabilities of 
individual parts must be extremely high if the 
over-all reliability is to be tolerable. Fig. 3, 
which has been adopted from Lusser [3], shows the 
relation of over-all reliability to individual re- 
liabilities for various degrees of equipment com- 
plexity. For simplicity, the individual reliabilities 
have been made equal. Notice, for example, that 
an equipment of 400 parts, each 99 per cent re- 
liable, only has a 2 per cent over-all reliability. 
This emphasizes what is probably the most im- 


‘A part is a series part if its failure would cause 
the entire equipment to fail. It is a parallel part if its 
failure would not necessarily lead to failure of the 
equipment, since it is shunted by another part. 


1960 


portant concept in the study of reliability, namely, 
that individual parts of a complex equipment must 
be of the very highest reliability. This means 


that the margins between the strengths and stresses 


must be sufficiently large. By stresses, we are 
referring not only to mechanical forces, but to 
other parameters such as voltage, current, fre- 
quency, temperature, humidity, acceleration, 
vibration, etc., to which a part is subjected in use. 
The strengths are the values of these parameters 
at which failure will occur under the given con- 
ditions. 


Ht 
jee 


OVER ALL RELIABILITY OF EQUIPMENT, PER CENT 


RELIABILITY OF EACH PART, PER CENT 


Fig. 3—Over-all reliability as a function of complexity 
and reliability of parts. 


To determine whether a part to be used in an 
equipment is of acceptable reliability, a stress- 
strength analysis is recommended. The following 
procedure is employed. A stress scatter diagram 
is constructed as in Fig. 4, depicting the stresses 
to which the part will be subjected in the intended 
application. These data will have been obtained 
from field measurements. A frequency distribu- 
tion curve is drawn and the mean and standard 
deviation calculated. Tests-to-failure are then 
conducted on a representative sample of the part 
whose use is contemplated and the strengths 
plotted. The frequency distribution curve, the 


HALIO: HIGH RELIABILITY OF ELECTRONIC EQUIPMENT 63 


mean and the standard deviation for the strengths 
are obtained. To determine the allowable margin 
between the mean stress and mean strength, the 
standard deviations of stress and strength are 
multiplied by suitable factors depending on the re- 
quired part reliability and the products are added. 


4 


SCATTER BAND 
OF TEST-TO- 
FAILURE DATA 


a 


SAFETY 
MARGIN 


+ 


SCATTER BAND 
OF STRESSES 
UNDER FIELD 
CONDITIONS 


STRESSES AND STRENGTHS 


~ 


POSITION NUMBER IN SEQUENCE OF TESTS 


Fig. 4—Distribution curves of strengths and stresses. 


Thus, the total permissible margin between the 
strength and stress means can be expressed as: 


Me Keg ee KS 


where M is the margin, K, and K, are the strength 
and stress factors; S, and S, being the correspond- 
ing standard deviations. If the actual margin is 
less than the permissible margin, it means that 

the part will have to be redesigned or replaced 
with a more reliable part. Stress-strength 
analysis of this type is of extreme importance in 
the effort to attain high reliability. 

It would be extremely desirable to standardize 
parts of high strength and to have this informa- 
tion on these parts assembled in handbooks avail- 
able to designers. A start has been made in this 
direction, but the trend will have to be greatly 
accelerated to meet the needs of the military. 
Vitro Corporation, RCA and Inland Testing 
Laboratories are among those who are doing 
pioneering work in this field. 


Sub-assemblies, assemblies and components 
should also be subjected to tests-to-failure. 
However, the purpose of testing these is to dis- 
cover failures caused by specific assembly effects, 


64 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


such as local resonances and ambient tempera- 
tures. Therefore testing of these items is recom- 
mended only after it has been determined that 
parts of extremely high reliability have been em- 
ployed under conditions of adequate margin of 
safety. Otherwise, this type of testing becomes 
very cumbersome and failures attributable to 

part unreliability mask those caused by assembly 
effects. 

Stress-strength analysis depends upon testing 
of parts to failure as contrasted to testing of com- 
plete equipments under operating conditions. Un- 
fortunately there has been too much reliance on 
the latter procedure as a means of seeking the 
achievement of reliability. This is to be deplored, 
since testing-to-failure furnishes a much better 
means of attaining this goal. For one thing, it 
makes it possible to determine very quickly the 
modes of failure and permit redesign so that 
reliable parts can be used in the equipment. The 
old bug-hunting methods depending on failure re- 
porting of equipments tested under normal oper- 
ating conditions would take forever and a day to 
accomplish the desired result. In addition, 
testing-to-failure is far cheaper, since this 
method requires substantially fewer tests. Ex- 
tensive flight testing of missiles, for example, can 
get to be rather expensive. Even then, the ulti- 
mate cause of failure is often not discovered. Or 
to put it another way, testing-to-failure means 
that we can buy much more reliability for a fixed 
amount of money. 

One of the ways in which part reliability may 
be improved is for the parts designers to refrain 
from designing universal parts. Design of a 
single part for several applications with widely 
differing specifications tends to make the relia- 
bility for each application lower than if a different 
type were built for each of these. A part is gen- 
erally designed for universality of application for 
two reasons. One is lower cost because of high 
quantity of production. The other is that the con- 
trol processes that accompany mass production 
tend to improve the quality and consequently the 
reliability of the product. However, there is a 
certain level of production beyond which the 
quality remains essentially constant. Once this is 
reached, the faults of the multiplicity of functions 
of a universal part become evident; e.g., in the 
case of electron tubes, a tube may be used in a 
de amplifier or in a blocking oscillator. Clearly, 
the specifications are different for these applica- 
tions. By designing a tube which is applicable to 
both of these uses, the reliability for each suffers. 
There is certainly sufficient demand for each type 
so that a different tube can be built for each appli- 
cation. It is therefore recommended that parts 


APRIL 


designers originate different types for widely 
varying uses. 

Another step the circuit designer can take to 
maximize reliability is to select part types which 
have higher inherent reliability. Semiconductors 
can be used in place of electron tubes, vacuum 
relays instead of other types, etc. 

The total effect of parts tolerances plus drift 
due to aging may cause failure of a circuit in 
operation; e.g., an oscillator may shift its fre- 
quency out of tolerance or may stop oscillating 
entirely; a flip-flop may reach such a condition 
that a prescribed pulse may fail to trigger it. 

To prevent such an occurrence, a design method 
known as marginal checking (developed by Lincoln 
Laboratory) is recommended. In this, the allow- 
able variation of a part is determined as a func- 
tion of a selected circuit parameter, usually a 
supply voltage. In practice, the tolerance of one 
of the parts in the circuit is plotted against the 
variation in this marginal-checking parameter, as 
shown in Fig. 5. For various values of part de- 
viation, the supply voltage is varied until the cir- 
cuit fails to perform according to specifications. 
The locus of failure points separates the failure 
region from that of normal operation. In this 
manner, not only can the proper design center 
value be determined but the allowable tolerances 
as well. Universal employment of marginal 
checking by equipment designers is decidedly 
recommended. 

The foregoing discussion has been concerned 
with the reliability of parts; some of the principles 


NORMAL 
OPERATION 


REGION 


SUE Si fe) = 5 O +555 +10) +15, 
PERCENT DEVIATION OF PART 


Fig. 5—Marginal checking-locus of failure points. 


1960 


involved in the integration of these parts to form 
reliable equipment can be enumerated. 

The first and most obvious precaution is to keep 
it simple. Granting that a given equipment will 
require a minimum degree of sophistication, the 
fact remains that there is still plenty of oppor- 
tunity to gild the lily. The temptation is great 
for our bright, inventive designers to emulate the 
Rube Goldberg approach; this author having done 
design work fully sympathizes with them, and 
realizes that designing for performance is much 
more interesting and glamorous than designing 
for reliability. However, the latter is one job 
that cannot be bypassed. 

Equipment should not only be simple in design, 
but simple to operate. One of the causes of 
equipment unreliability is the maladjustment of 
controls because of the excessive number of 
‘ront-panel adjustments which require an 
engineer’s training to be correctly set. This is 
jue to a design tendency to include controls 
which, when properly adjusted, increase equip- 
ment performance levels somewhat, but when 
maladjusted, reduce the equipment function to 
almost inoperable levels. 

In addition to operability, the equipment should 
be designed for a high level of maintainability. 
The latter is defined as the reciprocal of the mean 
net time to repair failures. Expressed mathema- 


ically, = E 


Because of the increasing complexity of equip- 
ment and the decreasing quality level of service 
sersonnel, it is necessary to make equipment 
2asy to maintain. This presupposes the adoption 
xy the designer of a disposal-at-failure mainte- 
lance philosophy. Circuits are designed as 
nodules for ease of troubleshooting and replace- 
nent. In accordance with this philosophy are the 
smployment of printed circuitry, encapsulation 
und miniaturization. It is recommended that 
specific maintainability requirements be included 
n development contracts for equipment. 

Another important principle is the practice of 
-onservatism of electronic circuit design. Parts 
should be derated and tube voltages should be 
selected so that the lowest values which give the 
-equired performance will be employed. The lat- 
er step improves reliability in several ways. 
Sart failure is minimized because of reduced peak 
urrents, lowered potential stress and decreased 
eat dissipation. The likelihood of avalanche 
ailure is greatly reduced. In addition, the conse- 
ent restricted energy level reduces the incidence 


HALIO: HIGH RELIABILITY OF ELECTRONIC EQUIPMENT 65 


of parasitic oscillations. One manufacturer of 
television receivers who was notorious for de- 
signing for stresses well above the reliability 
limits in order to conserve parts is brought to 
mind. Not only was the reliability of the receiv- 
ers extremely low, but the maintainability was so 
poor that most servicemen were extremely re- 
luctant to work on them. This approach is one 
certainly not to be recommended for military 
equipment. 

Redundancy is one method employed as a re- 
liability measure. Two or more identical parts 
are placed in parallel so that failure of one part 
will not make the equipment inoperative. This is 
akin to moving the pitcher, shortshop, second 
baseman and centerfielder all into line to field a 
ground ball or to the use of both suspenders and a 
belt to keep one’s trousers from falling. Redun- 
dancy is a necessary evil and is recommended 
only for critical parts where every effort to 
achieve the required part reliability has failed. 

Reliability considerations require that the 
parts in a circuit be integrated into a package 
which is designed with a view towards optimizing 
ruggedness and thermal adequacy. With respect 
to ruggedness, the design should be such as to 
restrict the maximum vibrational transmissibility 
(transmissibility is the ratio of induced to applied 
vibration amplitude) to a value as near to unity as 
possible. For example, the use of the clamped- 
clamped type of assembly, where mounting boards 
are clamped at both ends, rather than the canti- 
lever type, is recommended. The basic principle 
of adequate thermal design is to make the total 
equivalent thermal resistances from all heat 
generating parts to the thermal sink or environ- 
ment as low as possible. That is, adequate con- 
duction, convection and radiation paths should be 
provided to dissipate the heat. In most circuits, 
the electron tubes are the principal heat-gener- 
ating parts and their operating temperatures 
generally exceed the permissible operating tem- 
peratures of the other parts. Therefore, thermal 
adequacy begins with tube location. It is desirable 
to locate the tubes as far as possible from the 
parts having the lowest permissible operating 
temperatures. Employment of equivalent thermal 
circuit diagrams in a paper analysis is of great 
assistance in minimizing cut-and-try methods in 
design. Proper packaging for ruggedness and 
thermal adequacy is a very important step toward 
equipment reliability. 


Employment of standardized electronic circuitry 
by equipment designers can be very effective in 
achievement of high equipment reliability. The 
National Bureau of Standards and the Navy Elec- 


66 


tronics Laboratory have designed a variety of 
electronic circuits with the emphasis on a high 
order of reliability. These have been published in 
the ‘“‘NBS Preferred Circuits Handbook’’ and the 
‘NEL Reliability Handbook.’’_ A recent study of 
83 pieces of Navy electronic equipment showed 
that 30 per cent of the circuitry could be performed 
by the circuits listed in these handbooks. It is 
recommended that the development of preferred 
circuits be extended and that equipment designers 
get into the habit of using these as much as possi- 
ble. This may be a blow to the pride of designers 
who make a fetish of originality, but it will also 

be a blow struck against unreliability in their 
equipment. The use of standard assemblies which 
may be used in many equipments leads to a further 
gain in reliability because of the improved quality 
control which accompanies higher production 
levels. 

Proper liaison is an important factor and its 
omission can contribute to unreliability. Liaison 
between designer and user is desirable to acquaint 
the designer with the user’s environmental, oper- 
ating and maintenance problems. This is much 
more important with military than with commer- 
cial equipment since, in the case of military 
equipment, a specific number of equipments are 
contracted for and manufactured before there is 
feedback from the user to the producer informing 
the latter of equipment shortcomings. However, in 
the case of commercial equipment, feedback be- 
gins with the first shipments of equipment so that 
design weaknesses can be corrected before large 
scale production takes place. This author recalls 
once having to redesign a piece of equipment after 
pilot production had begun simply because the de- 
signer had not been aware of the conditions of 
operation of the equipment with the result that the 
latter proved to be unreliable for the intended ap- 
plication. Proper liaison with the user would have 
obviated the difficulty. 

Liaison among the various groups involved in 
the development of an equipment is important, too. 
RCA uses an elaborate system to ensure maximum 
reliability. After the development contract is 
awarded, the design engineer must justify his ideas 
before a panel of experts—reliability engineers, 
parts people, specialists in shock, vibration and 
heat, circuit designers, etc. This is done before 
the design is started and also after the breadboard 
is ready. When the model is constructed, it is 
thoroughly tested, the results being reviewed by 
experts and analyzed in terms of the whole system. 
Weak points and lack of reliability are spotted. 
Undesirable interaction effects between various 
components of the equipment are eliminated. If 
found necessary, other tests are recommended, 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


circuits are modified, packaging is changed. In 
the end, the equipment functions according to 
specified requirements. All of this review may 
seem to be unnecessarily time-consuming, but this 
procedure produces very large savings in re- 
engineering costs and what is more important, re- 
sults in a highly reliable product. Emulation of 
this philosophy is definitely recommended for all 
developers of military electronic equipment. 


PILOT PRODUCTION 


Pilot production follows development of the 
equipment, and its primary purpose is to enable 
the customer to get an idea of what may be avail- 
able from regular production. It is also of bene- 
fit to the manufacturer in that it permits him to 
prove out the tooling and manufacturing processes. 

In addition to the usual performance tests, a 
battery of environmental tests should be carried 
on to determine the reliability. Among these 
should be temperature and input voltage variation, 
vibration and off-on cycling as a minimum. Other 
environments Selected depend on the corresponding 
service conditions and may include humidity, salt 
spray, sand, dust, shock, radiation, etc. 

Because of the inherent characteristics of the 
pilot production process, the output is unavoidably 
heterogeneous and the reliability tests are indica- 
tive of the capability of the manufacturing process 
rather than of acceptability. 


MANUF ACTURE 


We will now consider the area of manufacture, 
or full production. The most obvious method of 
assuring that unreliabilities do not creep in during 
the manufacturing process is by practicing ade- 
quate quality control. 

Another step which can be as important is to 
survey and rate the vendors in the field, qualify- 
ing their products. 

Automation can be of help im improving relia- 
bility. Investigations indicate that mechanized 
assembly techniques for electronic equipment 
tend to maximize reliability. These techniques 
include processed wiring circuitry, mechanized 
insertion of parts, automatic mass soldering and 
automatic functional testing. Mechanized produc- 
tion and testing methods possess an advantage over 
manual methods in that the former avoid the ir- 
regularities in techniques and materials of the 
latter, resulting in improved reliability. 

One of the reasons for the existence of unre- 
liable equipment is the tendency to rush it into 
production before the development has really been 
completed. Present procurement practices, 


1960 


which aim to provide accelerated delivery of 
electronic equipment, tend to minimize the time 
allowed for adequate reliability evaluation. This 
telescoping of development with procurement is 
accomplished at the expense of a sound reliability 
test program during the vital engineering phase 
and must necessarily be reflected in decreased 
reliability of the end product. Therefore, it is 
recommended that production be postponed until 
adequate engineering tests prove that the item in 
question fully meets the reliability requirements. 


TRANSPORTATION 


The transportation phase furnishes an excellent 
opportunity for introduction of unreliabilities. 

The military services have experienced substan- 
tial damage to equipment during shipping, result- 
ing from improper packaging and packing. Since 
the damage which occurs is not always detectable 
and therefore repairable, incipient failures may 
easily occur. Proper packaging and packing is an 
important link in the reliability chain. 

The steps taken to insure proper packaging de- 
sign of the equipment to withstand operating shock 
and vibration will also serve to protect it during 
transportation. In addition, it is necessary to 
investigate the shock and vibration experienced 
by equipment packed and shipped in containers. 

It is recommended that instruments be developed 
which will record the amplitudes and durations of 
shocks to which equipments are Subjected in 
shipment. These should be of a type which will 
operate unattended for a period of several weeks. 

In addition, it will be necessary to determine 
specific cynamic values for a wide variety of 
cushioning materials for the use of designers of 
shipping containers. 


The recommended research in packaging and 
packing should lead to increased operational re- 
liability of electronic equipment. 


STORAGE 


Since production contracts provide for sufficient 
numbers of equipments not only to meet the current 
operational requirements but also to allow an 
adequate reserve, it is obvious that the excess 
must be kept in storage for appreciable periods of 
time. This process subjects these items to the 
deleterious effects of corrosion, chemical action 
and other forms of deterioration, thus posing an 
additional reliability problem. 

Equipment should be stored under conditions 
which minimize rate of deterioration. This sounds 
very Simple, but it is a fact that these conditions 
can only be made known by: 1) accelerated aging 


HALIO: HIGH RELIABILITY OF ELECTRONIC EQUIPMENT 


67 


tests, and 2) monitoring of items in storage. Ac- 
celerated aging tests are necessary to obtain data 
in a relatively short time during equipment devel- 
opment. However, since the conditions encountered 
during storage cannot be perfectly simulated, they 
are not a completely satisfactory substitute for 
storage monitoring. 

In order to provide data on deterioration in a 
form which is readily usable by: 1) the agency 
directly concerned with the given item, and 2) 
agencies which require such data as background 
information for similar items or subdivisions 
thereof, it is essential that such data be made 
available in convenient form. The most suitable 
forms are considered to be punch cards or mag- 
netic tape. At present, huge masses of informa- 
tion are buried in miscellaneous and heterogeneous 
reports in the archives of multitudes of agencies. 
As the number of equipments in existence in- 
creases, this situation will become greatly aggra- 
vated unless a streamlined system of data report- 
ing and reduction is adopted. It is recommended 
that a working group be established at Department 
of Defense level to develop such a system of data 
handling which will be uniformly employed by the 
various services and will be designed to be com- 
patible with the requirements of all the agencies. 
However, if such a plan is to succeed, it must be 
implemented by directive at DOD level which will 
make use of the adopted system mandatory. 

In order to provide maximum benefits from 
such a system, it would be advisable to retain a 
life history of each individual equipment from the 
time it is manufactured until it is removed from 
service by the operating unit. Only in this man- 
ner can a comprehensive knowledge of the varia- 
tion in condition be obtained. The information 
that is obtained by the reporting sources will be 
transcribed to forms suitable for handling by 
computing machines and will be subjected to 
statistical and engineering analyses. Results ob- 
tained will be in a form that can be used directly 
by designers, manufacturers, storage personnel 
or operating agencies. The Ballistic Research 
Laboratories are presently working on sucha 
system for use with military electronic equipment. 
Collaboration with groups working on the same 
problem is invited. 

In order that the data obtained be valid, it is 
essential that the equipment used to test these 
items be of sufficient precision. Although the spe- 
cifications for equipment in storage and operation 
are generally less stringent than those for accept- 
ance, this should not imply that a corresponding 
decrease in precision of test sets used at these 
stages is permissible; rather, all test sets used 
for any given equipment should be of similar ac- 


68 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


curacy, regardless of whether employed in the ac- 
ceptance, storage or operational phases. Only in 

this manner can trustworthy and comparable data 

be obtained. 

Accuracy of measuring equipment presumes 
calibration against precise standards. In the in- 
terests of obtaining uniform results, it might be 
desirable to appoint a panel at DOD level to pre- 
scribe calibrating equipment to be employed. In 
fact, it may even be advisable to institute a Mili- 
tary Bureau of Standards, similar in function to 
the Calibration Division of the National Bureau of 
Standards but slanted towards calibration of the 
type of test equipment employed by the Armed 
Forces. 

Another prerequisite for assuring validity of 
data is employment of high caliber technicians in 
the organizations performing the reporting func- 
tion. This is contingent upon acceptance of the 
recommendations included in the Cordiner report 
dealing with the shortage of trained technicians in 
the Armed Forces. More will be said about this 
problem in the portion of the paper devoted to 
maintenance. 

Another cause of insufficient reliability is the 
fact that the design of suitable test equipment is 
usually treated as a secondary consideration. The 
testers are often not available until it is much too 
late to be of use in assuring reliability of the item. 
It is necessary that design of the basic equipment 
and its testers be treated integrally. 


OPERATION AND MAINTENANCE 


Maintenance is an important factor in the effort 
to achieve reliability, its purpose being to sustain 
designed performance and continued operation of 
equipment and systems in order to attain the 
highest degree of operational readiness. 

Maintenance of electronic equipment is depend- 
ent upon such factors as equipment maintainability, 
personnel training, preventive maintenance pro- 
cedures and quality of support material such as 
technical manuals, test equipment and test 
facilities. 

Maintainability has already been defined in 
this paper as the reciprocal of the mean net time 
to repair failures. Unfortunately, there is nothing 
that maintenance personnel can do about this 
characteristic, since it is predetermined. If the 
design people have been careful to observe the 
tenets of the disposal-at-failure maintenance 
philosophy such as modular construction, encapsu- 
lation, etc., then the maintainability of the equip- 
ment should be high. 

Even if the design of the equipment enables 


APRIL 


relatively unskilled personnel to perform the 
maintenance function at the lower echelons, highly 
skilled technicians are still needed at the top 
echelons. Unwise policies have permitted the 
situation to deteriorate to the point where large 
numbers of extremely expensive equipments are 
at the mercy of fewer and less skill personnel 
than ever before. This grave situation can be 
alleviated only by taking immediate and drastic 
steps. The most effective one would be the de- 
creasing of the high turnover rate of trained 

men by offering sufficient incentive to remain in 
the services. This could be accomplished by 
raising the pay scales to realistic levels and by 
reinstituting the many fringe benefits which once 
were enjoyed. To assure that ability rather than 
longevity should be the basis for promotion, a 
merit system should be adopted. Another very 
effective means of maximizing available skilled 
manpower is the elimination of the practice of 
requiring the technician to perform nontechnical 
routine duties, such as K.P., guard duty, etc. A 
less direct, but nevertheless important factor is 
the low level of technical background possessed 
by the average recruit, necessitating inordinately 
long training periods acquiring basic knowledge 
which should have been obtained previously. It is 
therefore advantageous to the Department of De- 
fense to seek the adoption of better and more 
thorough training in mathematics and the physical 
sciences at the secondary school level. 

Reliability can be greatly increased by de- 
tecting potential failures before they have an 
opportunity to occur. One of these maintenance 
techniques is called marginal checking and is 
related to the marginal checking performed during 
design. 

The principle underlying marginal checking of 
electronic equipment as a preventive maintenance 
procedure is as follows. If all the parts are in 
good condition, then variation of parameters, gen- 
erally power supply or signal voltages, will not 
cause the equipment to fail. However, failure may 
be induced if a part has deteriorated; e.g., if the 
transconductance has been appreciably reduced. 
The method employed is to vary voltages between 
specified limits and observe whether the equip- 
ment functions properly. For instance, in the 
checking of a computer, a problem may be fed to 
it while varying the voltage on portions of the 
computer in turn. An incorrect answer serves to 
localize the maloperating circuit and then the 
potentially defective part. 

Another technique which is being investigated is 
the prediction of imminent failure based on the 
variation of parameters of certain electronic parts 


1960 


Armour Research Foundation, under Air Force 
contract, is conducting studies which indicate that 
resistor noise progressively increases prior to 
failure. It also appears that a decrease in insula- 
tion resistance of both resistors and inductors 
may be a harbinger of failure. Thus, monitoring 
of the equipment may provide a means of prevent- 
ing failures by furnishing sufficient warning to 
permit part replacement. It is suggested that re- 
search along these lines be expanded, since appli- 
cation of these principles will be of great assist- 
ance in improving reliability. 

Maintenance of electronic equipment requires 
the use of precision test equipment for a variety 
of measurements. Calibration of testers must be 
dependable regardless of time or location. This 
requirement imposes a need for uniform calibra- 
tion standards throughout all military installa- 
tions. It is recommended that DOD set up cali- 
bration centers in selected areas. Utilizing the 
National Bureau of Standards for primary re- 
ference, these centers would service all military 
agencies within their respective areas. 

Support material such as test equipment, train- 
ing, and instruction manuals are essential for the 
proper performance of the maintenance function. 
Yet, more often than not, these are not available 
simultaneously with the main equipment. Opera- 
tion of the latter without the guidance furnished 
by the applicable technical manuals and use of the 
proper test equipment is not conducive to achieve- 
ment of maximum reliability. Therefore, it is 
recommended that no equipment be released for 
distribution unless accompanied by the applicable 
support material. 

The scope of this field is tremendous, so that 
only the highlights have been touched upon in this 
paper. However, if the recommendations which 
have been made were universally adopted, this 
author believes that the reliability of our military 
electronic equipment would be greatly increased. 
In fact, if this accomplishment were made known 


HALIO: HIGH RELIABILITY OF ELECTRONIC EQUIPMENT 69 


to the enemy, it might even serve as an effective 
deterrent to military conflict. 


REFERENCES 


[1] Advisory Group on Reliability of Electronic Equipment 
(AGREE), Office of Assistant Secretary of Defense 
(OASD), ‘‘Reliability of Military Electronic Equipment,”’ 
Rept.; June 4, 1957. 

[2] R. Lusser, ‘‘The Notorious Unreliability of Complex 
Equipment,’’ Redstone Arsenal, Huntsville, Ala., Rept.; 
September, 1956. 

[3] R. Lusser, ‘‘Reliability of Guided Missiles,’ Redstone 
Arsenal, Huntsville, Ala., Rept.; September, 1954. 

[4] S. G. Bassler, ‘‘Principles of electronic circuit 
packaging,’’ Electrical Manufacturing; 

August, 1955. 

[5] S. G. Bassler, ‘‘Electronic circuit packaging for 
missile applications,’’ Electrical Manufacturing; 
March, 1958. 

[6] R. Lusser, ‘‘Which road to reliability?’ Electronic 
Equipment; January, 1957. 

[7] R. Lusser, ‘‘A Study of Methods for Achieving Reliabil- 
ity of Guided Missiles,’’ U. S. Naval Air Missile Test 
Center, Point Mugu, Calif. Tech. Rept. No. 75; July 10, 
1950. 

[8] R. Lusser, ‘‘General Specifications for the Safety 
Margins Required for Guided Missile Components, ’’ 

U. S. Naval Air Missile Test Center, Point Mugu, 
Calif. Tech. Rept. No. 84; July 10, 1951. 

[9] R. Lusser, ‘‘Planning and Conducting Reliability Test 
Programs for Guided Missiles,’’ U.S. Naval Air Missile 
Test Center, Point Mugu, Calif. Tech. Rept. No. 70; 
June 20, 1952. 

[10] M. A. Acheson, ‘‘The unreliable universal component,”’ 
Electronic Equipment; January, 1957. 

[11] K. A. Pullen, ‘‘A new approach to conservative design,’’ 
Electronic Equipment; May, 1957. 

[12] C. J. Savant, and H. S. Hansen, ‘‘Reliability as a 
responsibility of engineering management,’’ IRE 
TRANS. ON RELIABILITY AND QUALITY CONTROL, 
No. PGROC-9, pp. 45-48; January, 1957. 

[13] R. C. Marder, ‘‘The effect of mechanized production 
techniques upon reliability,’’ Military Automation; 
January-February, 1958. 

[14] Department of Defense Instruction Number 3232.2. 
“Electronic Equipment Failure Data Reporting 
System and DD Forms 787 and 787-1,’’ Rept.; 
February 23, 1956. 


70 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


WHAT PRICE UNRELIABILITY 


DANA A. GRIFFINT 


PART | 


The continuing record of mission failures of our 
complex ICBM and IRBM missiles is highlighted by 
every communication medium in the world. Simi- 
lar mission failures of other types of complex sys- 
tems are not publicized. We can be sure that they 
are extremely costly because there are many more 
of them. 

The excessive costs of such unreliability might 
be written off as a necessary means to the end, if 
this were the only factor to be considered. Unfor- 
tunately, this is not the case. Mission success in 
the ICBM-IRBM missile and the antimissile field 
depends upon the ability to act instantaneously 
without failure. 

The purpose of this article is to demonstrate 
the need for a realistic look at the mission suc- 
cess potential of our complex weapons systems 
and the way to obtain a large increase in mission 
success potential at much lower costs. 

Current military procurement practice is to 
establish a reliability requirement on every weap- 
on system and then assume that the prime con- 
tractors will attain the desired level of perfor- 
mance using presently available parts. This 
policy establishes a defense posture based on 
wishful thinking rather than reality insofar as our 
complex systems are concerned. The basic 
building blocks of all systems, the small compo- 
nent parts, do not possess the life expectancy to 
meet the system reliability requirements. 

The moment we express permissible system 
failures in terms of specific missions, we must 
automatically consider the life expectancy of 
every part of the system that is likely to fail and 
by so doing, cause a mission failure. 

Computations on many systems now in develop- 
ment or production clearly indicate that the relia- 
bility requirements cannot be met with a high de- 
gree of confidence. The failure rate of the major- 
ity of the electronic component parts presently 
available for use in these systems is from 10 to 
100 times poorer than the desired system failure 
rate will allow. 

Let us distress the statisticians with three 
oversimplified expressions of an elementary for- 
mula: 


{The Daven Company, Livingston, N. J. 


A 


Bp7C, A=BXC or =B 


al> 


If A equals the permissible mean time between 
failure rate of the system in hours, and B equals 
the number of parts whose individual failures can 
cause a mission failure, C can be expressed as 
the permissible number of part failures in a given 
number of hours. 

If we know the value of A and the number of 
parts in the system B, we can determine the 
value of C. Similarly, if we know the number of 
parts involved, B, and their failure rate, C, we 
can determine A, the system rate. If we know 
A and C, we can determine the maximum number 
of components we can employ in a system to ob- 
tain the requisite system failure rate. 

For the past two years, contracts have been let 
for complex weapon systems with values assigned 
to A that are required to assure mission suc- 
cess, without regard to the part count, B, or 
component part failure rates, C. 

With a simpler mathematical table (Table I) we 
can get to the nub of the situation. Component 
part failure rates are expressed in terms of their 
Acceptable Reliability Levels (ARL) which permit 
a certain number of failures per thousand hours of 
operation in a given quantity of parts. These are 
expressed in ARL percentages as indicated below. 


TABLE I 


ARL 
Percentage 


Number of Failures per 1000 Hours 
per 100,000 Parts 


If a missile system employes 100,000 parts 
and the permissive MTBF rate for the system is 
100 hours, we need parts with a failure rate of 
0.01 per cent to obtain the specified system fail- 
ure rate. Electronic components with these fail- 
ure rates may be available in the Soviet Union, 
but they are conspicuous by their absence in the 
United States. 

The basic reason for this unfortunate situation 


960 


§ the failure of our military procurement person- 
1el to recognize the nature of our free enterprise 

system. There are two independent tiers of man- 
ifacturers, broadly speaking. The first tier pro- 

juces systems and the second tier produces many 
of the critical component parts that make up these 
systems. 

Billions of dollars have been allocated to first- 
ier contractors to produce systems using pres- 
ently available, relatively unreliable component 
oarts. A companion program of financial aid to 
second-tier component part manufacturers for 
>omponent part improvement to the levels re- 
juired by first-tier contractors does not exist. 

The failure to provide for a component part 
improvement program has created an unprece- 
Jented situation. The manufacturers of a number 
of complex systems require component parts that 
meet a 0.01 per cent ARL. They cannot buy such 
parts at any price from component part manufac- 
turers at the present time. 


Bankruptcy via Redundancy 


At first glance, it might be assumed that 
enough unreliable missiles can be purchased to 
offset their lack of adequate MTBF rates. The 
nopelessness of this approach can be illustrated 
oy a purely hypothetical example. 

An effective antimissile, missile shield for our 
continent might require 1000 missiles ready to 
fire. Their complexity might demand the use of 
component parts with an ARL of 0.001 per cent in 
order to provide the necessary mission success 
ootential. 

In order to approach this goal with missiles 
ising parts with a 0.1 per cent ARL, 100,000 mis- 
siles will be needed! Ignoring the cost of the 
2xtra ground installations and the full-sized army 
0 repair and operate the unreliable missiles, an 
ypproximation of production costs may prove of 
nterest as it points the way to the justification of 
large-scale program for component part improve- 
ment. 

As an example, we might buy 100,000 missiles 
ARL 0.10 per cent) at $1,000,000 each for 
5100,000,000,000. Or, by arbitrarily increasing 
he unit cost ten times to insure reliability, we 
night buy 1000 missiles (ARL 0.001 per cent) at 
$10,000,000 each for $10,000,000,000. We must 
spend an additional 90 billion dollars for enough 
inreliable missiles to approach the same mission 
success potential, and we must ignore the time 
vasted in firing 99 abortive shots out of 100 which 
vill increase the probability of enemy break- 
hrough by a substantial amount. 

_ There is no reason to assume that it will cost 


GRIFFIN: WHAT PRICE UNRELIABILITY Td 


9 billion dollars to raise the ARL from 0.1 per 
cent to 0.001 per cent. However, it will require 
substantial expenditures at the second-tier level 
of component parts manufacturers for development 
and facilities. This is by no means the only justi- 
fication for a major attack on the problem of im- 
proving component part life expectancy to levels 
undreamed of a few years ago. 


Maintenance Costs 


Major improvements in component part life ex- 
pectancy can reduce our annual maintenance and 
repair bills by many billions of dollars. 

The high cost of replacing short-life component 
parts with similar short-life parts can be demon- 
strated by elementary multiplication and the fail- 
ure rates listed in Table II. 

As we produce systems requiring billions of 
parts every year, plus billions more for replace- 
ment purposes, replacement costs per billion parts 
will be estimated for the various ARL percentages. 


TABLE II 
ARL 
percentage | No. of Failures Unit Cost for 50,000 
per 1000 per Billion Replacement | Cost per 1000 | Hours (System 
Hours Parts Cost Hours Life) 
1 10,000,000 $20.00 $200,000,000 | $10,000,000,000 
0.1 1,000,000 20.00 20,000,000 1,000,000,000 
0.01 100,000 20.00 2,000,000 100,000,000 


0.001 10,000 20.00 200,000 10,000,000 


If we assume a current failure rate at the 0.10 
per cent level, a change to the 0.01 per cent level 
will save $900,000,000 in repair costs per billion 
parts in a 5-year period. We have reason to be- 
lieve that the unit replacement cost will be chal- 
lenged. If all cost factors are considered, prob- 
ably it should be increased by a substantial 
amount. The dollar-wise accuracy of the calcula- 
tions are not particularly important, however. We 
are only considering 1 billion parts in a military 
hardware system that employs and stocks many 
billions of parts. 

The important factor is the huge reduction in 
repair costs that will obtain by decreases in fail- 
ure rates anywhere in the range between the 1 per 
cent ARL and the 0.01 per cent ARL. 

Realistic figures on weapon system mission 
success probability can only be obtained with a 
knowledge of the ARL percentages on all categor- 
ies of parts going into the systems, plus a count 
on the number of each category of part. 

A continuous survey of ARL percentages cur- 
rently available in all part categories will give the 
Defense Department a powerful measuring tool 
that can be used to determine realistic system re- 


72 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


liability levels and also serve as the basis for a 
large-scale program for the improvement of 
component part life expectancy that is so sorely 
needed. 


PART Il 


In Part I of this paper, the need for a major 
improvement in the life expectancy of the parts 
going into our weapons systems was demonstra- 
ted, and the justification for major expenditures 
in this field was developed in two areas. Many 
billions of dollars can be saved by a reduction in 
the number of systems required to insure mission 
success, and system repair costs can be dimin- 
ished by a substantial increase in component part 
life expectancy. 

In Part I it was suggested that our weapons 
systems reliability be reappraised from the bot- 
tom up. That is, to use the life expectancy levels 
of presently available component parts in system 
reliability predictions. The difference between the 
desired system mean time between failure rates, 
specified by our tactical experts in our weapon 
system contracts, and the realistic rates which 
such computations will disclose, should give some 
cause for alarm to those charged with the respon- 
sibility of protecting our major cities from total 
obliteration and winning a major war should the 
occasion arise. 

Unfortunately, we cannot buy time, so in order 
to dispel possible complacency on the part of our 
tactical commands, it is necessary to point out 
that the immediate implementation of a component 
part improvement program cannot possibly result 
in improved systems for 3 to 5 years. 

The reasons for this time delay will be dis- 
cussed later. It is important that the tactical com- 
mands realize that they are being forced to use 
complex systems now in production or in the de- 
velopment stage which will not provide the speci- 
fied mission success potential and that no substan- 
tial improvement can be expected for some time to 
come! 

We have the ‘‘know-how’”’ and the facilities to 
build complex systems, but we don’t have the 
‘‘know-how”’ or the facilities to build the reliable 
parts that these complex systems require in order 
to function in the specified manner. 

There are many reasons for this deplorable 
state of unpreparedness. The major factors are: 


1) The state of art in component part develop- 
ment. 

2) Failure to recognize the scope of the prob- 
lem. 

3) The lack of adequate funds and of willingness 


APRIL 


to spend them at the proper level in our two- 
tier industry. 


It would be easy to write a full-length volume 
on the history of electronic component part devel- 
opment for the past forty years. In brief, there 
have been two motivating factors: the needs of the 
home radio-television industry and the military 
services. The volume production demands of the 
former are responsible for the capital investment 
for production facilities and much of today’s 
‘“‘know-how.’”’ Military expenditures have been 
almost exclusively confined to the areas of basic 
research and the development of parts that would 
work under field conditions. 

These combined efforts have enabled us to 
reach a plateau where the ARL’s of most parts 
range between 0.5 per cent and 0.1 per cent. This 
is more than adequate for the needs of the radio 
and television industries. Further improvement in 
component part life expectancy must be wholly 
financed by the military services. 

A shift from these ARL’s to a new plateau of a 
0.01 per cent ARL for all categories of component 
parts is an innocent-looking expression. Actually, 
its attainment will require major technological 
breakthroughs in physics, chemistry, product de- 
sign, and process control, to name a few areas 
where current knowledge and techniques are 
inadequate. 

Today, the services are buying electronic com- 
plexity in varying degrees in modern weapon sys- 
tems; production can no longer be evaluated in 
terms of dollars per pound of product. These pro- 
duction yardsticks of World War II are completely 
inadequate. 

The importance of increased component part 
life expectancy in complex systems can be illus- 
trated with Table III which is completely hypo- 
thetical. 


TABLE It 
Number of ARL ARL Mission 
Parts Approxi- | Percentage | Percentage} Success 
Name mately Required Available | Potential 
Titan 100,000 0.01 0.1 Very Poor 
Atlas 50,000 0.02 0.1 Poor 
Thor 10,000 0.1 0.1 Good 
Jupiter 15,000 0.05 0.1 Fair 
Polaris 75,000 0.01 0.1 Very Poor 


The substitution of real numbers in a table of 
this type on all of our complex weapons systems 
would be enlightening, to say the least. The mis- 
sion success potential is evaluated in terms of the 
disparity between an optimistic value of a 0.1 per 
cent ARL for available component parts and the 


1960 


component part ARL required to meet the speci- 
fied MTBF rate for the system in question. 


Component Part Survey 


As suggested in Part I, the first sensible cor- 
rective steps will be to survey available ARL’s for 
every category of part and obtain a count on the 
number of parts of each category going into each 
system. From these data, grand totals on each 
category of parts required by the services can be 
obtained and the scope of the problem can be 
established. 


Facilities Funding 


From this survey we may find, for example, 
that 50,000,000 transistors with an ARL of 0.01 per 
cent are needed between 1961 and 1962. Since fully 
automatic production facilities are required to 
meet the ARL of 0.01 per cent, funds for this pur- 
pose must be provided just as machine tools and 
billions of dollars’ worth of other facilities are 
supplied to first-tier prime contractors. 

The question of how many contractors and 
which of the many potential contractors will be 
given facilities must be resolved, plus hundreds of 
other questions in many areas. An effective, well- 
coordinated program covering all categories of 
component parts will require an efficient adminis- 
trative organization within the Defense Depart- 
ment at the decision-making level, which presently 
does not exist. 

There are a number of other factors that affect 
weapons systems performance adversely. Errors 
in design, factory malpractice, inadequate inspec- 
tion, improper installation, and poor maintenance 


BARLOW AND HUNTER: DETERMINING OPTIMUM REDUNDANCY 


73 


play their part in the reduction of mean time be- 
tween failure rates. Any of these faults can be 
corrected in a relatively short space of time. 

This is not the case insofar as major improve- 
ment in component part life expectancy is con- 
cerned. Basic research, part development and the 
design and fabrication of the requisite production 
facilities will take large amounts of time even 
though a so-called crash program is instituted to 
accomplish the desired results. 


CONCLUSIONS 


1) Our defense posture is impaired by the de- 
gradation of weapon system mean time be- 
tween failure rates below the desired levels. 
This is occasioned by the lack of component 
parts with adequate life expectancy. 

2) We are unprepared to produce component 
parts with the life expectancy required. 

3) It will take from 3 to 5 years to obtain a 
major improvement in component part life 
expectancy after the initiation of a large- 
scale program in this area. 

4) A survey of available component part ac- 
ceptable reliability levels can be used to 
determine realistic weapon system failure 
rates and the scope of task of improving 
component part life expectancy to the levels 
required by our complex weapons systems. 

5) Corrective action by the Department of 
Defense is essential. It will enhance our de- 
fense posture and save billions of dollars 
annually in production and maintenance costs. 


CRITERIA FOR DETERMINING OPTIMUM REDUNDANCY* 


R. E, BARLOW and L. C. HUNTER 


Summary—Redundant circuits whose compo- 
nents may suffer either an open-circuit or a 
short-circuit type of failure are considered. A 
probabilistic model for such circuits is pro- 
posed. Two criteria for determining optimum 


*This work was prepared for the U. S. Army Signal Re- 
search and Development Lab. under Signal Corps Contract 
No. DA 36-039 SC-78281. 

yElectronic Defense Lab., Sylvania Electric Products, 
Inc., Mountain View, Calif. 


redundancy are studied. A formula for obtain- 
ing the number of components which maximize 
reliability is derived for general failure distri- 
butions. A table for obtaining the number of com- 
ponents which maximize the expected life of the 
circuit is presented for the case of exponential 
failure. 


INTRODUCTION 


By a redundant circuit, we shall mean any cir- 
cuit all of whose primary components perform the 


74 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


same function. For example, an arrangement of 
n switches or diodes could constitute a redundant 
circuit. We shall be mainly concerned, however, 
with parallel and series circuits and arrange- 
ments built up of such units. Continuing the dis- 
cussion from a previous report [1], we shall de- 
note the reliability of the circuit at time t by 
R(t). This is understood to be the probability that 
the circuit is operating at time t, given that it 
was put into operation at t= 0. We agree to 
count time only while the circuit is operating. 
Since we are not concerned with repair, the fail- 
ure distribution of the circuit is given by 


G(t) = 1- R(t). 


Knowing G, the expected life of the circuit can 
be computed. 

It is usually assumed that increased redun- 
dancy assures increased reliability. However, in 
practice, components in parallel or series often 
fail in such a way as to effect the entire circuit. 
For example, given a network of diodes in paral- 
lel, one could short-circuit and render the whole 
network inoperative even though one or more of 
the remaining diodes were in operating condition. 

We shall be concerned with redundant circuits 
whose components can fail in either of the follow- 
ing two ways: 


1) An open-circuit failure can occur; 
2) A short-circuit failure can occur. 


Good examples of the first type of failure are: 
a diode may fail and not allow current to pass in 
either direction, or an open switch may fail to 
close when desired. Examples of the second type 
of failure are easy to find. A diode may fail and 
allow current to pass in both directions, or a 
closed switch may fail to open when desired. 

Lipp [3], Price [5], and others have considered 
probabilistic models which contain the features 
described above with the exception that failure 
distribution functions were not explicitly consid- 
ered. Our model also differs in other important 
respects. Moskowitz [4] has given an adequate 
analysis of redundancy networks but does not con- 
sider components which can fail in either of two 
ways. 

In order that our calculations be valid, we 
shall need to make the assumption that circuit 
components are independent in a probability 
sense. This precludes the application of our re- 
sults to many important circuits. Gilmore and 
Levi [2] have considered the problem of compo- 
nent independence in terms of adequate isolation. 
They indicate that such isolation is practical in a 
vacuum tube design but that transistors are not 
as easily isolated. They point out that redun- 


APRIL 


dancy techniques may even be detrimental rather 
than beneficial in some cases. 


MAXIMIZING RELIABILITY 


Let Fj denote the failure distribution of the 
ith component in a redundant circuit where F 
does not distinguish between types of failure. Let 
p; denote the probability of an open-circuit fail- 
ure in the ith component given that it has failed. 
We note that p; is a conditional probability. 
Similarly, we let 1- p; denote the probability of 
a short-circuit failure in the ith component given 
that it has failed. Let U denote a unit consisting 
of m components in parallel and let V denote a 
unit consisting of n components in series. 

Suppose all components of U_ have identical 
failure distributions. Then it is easily seen that, 
for a parallel circuit 


m-1 fe &, 
R_(t),=-% Jt - F(t) pr 
Bea eR 
= [1- F(t) + pF(t)}” - [pF(t)]™ (1) 
ema. ym : 


Note that x™ - yn is non-negative and zero for 
m = 0 and m=. Since the derivative of this 
expression with respect to m_ set equal:to zero 
has a unique solution, it must be a maximum. 
Hence, the optimum integer m is close to the 
value 


m* = log[(log y)/log x] /log (x/y). 
In general, m= 1 is best whenever p<3, as 
one would expect. To see this, note that p< 3 
implies 
F(t) 


> 2pF(t) 
or 1. 


1 - F(t) + 2pF(t) =x+y 


and Kenya 
Since the expression xM - yM jis continuous in 
m and has a unique maximum when we allow m 
to assume all positive real values, the result 
follows. 

Let all components of V_ have identical fail- 
ure distributions. Then, substituting (1 - p) for 
p in (1), we have, for a series circuit, 


R(t) = [1 - pF(t)]” - [(1 - p)F(t)]® 
=wi- gn, (2) 
The optimum n for fixed p and F(t) is the 
Same as before when we substitute w for x and 


Z for y. In general n= 1 is best whenever 
P23. 


960 


Proceeding in the same way, we can determine 
he reliability of more complicated networks. For 
Onvenience we make the following definitions. 
efinition 1 


A parallel-series (PS) arrangement shall de- 
ote m type V units in parallel. 


efinition 2 
A series-parallel (SP) arrangement shall de- 
ote n type U units in series. 


Consider m type V units in parallel. We 
7vish to determine the reliability of a parallel- 
eries arrangement. To do this, let 


a = probability a unit has not failed, 
b = probability a unit has failed favorably, 
c = probability a unit has failed unfavorably. 


n this case, V fails unfavorably if all n com- 
nents in V short-circuit. Again, let all com- 
nents have identical failure distributions. Then 


(PS): Rpg(t) = [a+ b]™- b™. 
ince a+b+c#=1, and 
e = [(1 - p)F(t)]" 
= [(1 - F(t) + (1 - p)F(t)]” - 
ve have 
(PS): 


[(1 - p) F(t) ]” 


Rpg(t) = [1 - {1 - p)F()}P]™ 
See pre jt 


A similar argument showing the reliability of a 
eries-parallel arrangement is 


(SP): Rep(t) = [1 - {pF(t)}™]" 
- [1 - {(1 - p)F(t)}™]> 


30th expressions can be optimized over m and 
. by setting the partials, with respect to m and 
. respectively, equal to zero and then solving. 
“he optimum values will, of course, depend on 
“(t). 

If p=, we can obtain optimum reliability 
vith either a (PS) or (SP) arrangement. To see 
his, fix F(t) and let Rps(t) = f(m,n) and max 


m, n 
(m,n) = {(m°,n°). Let Rsp (t) = g(m, in) and max 
(m,n) = g(m',n'). Note that for p = 3, f(m,n) = 


(n,m). Since 

f(m°,n°) > £(n',m') = g(m',n') 
nd g(m',n') > g(n°,m®) = f(m°,n°) 
ve have f(m9,n°) = g(m',n’). 


lence, m®=n' and n° =m". In this case we 
an obtain optimum reliability with either a (PS) 
r (SP) arrangement. 


BARLOW AND HUNTER: DETERMINING OPTIMUM REDUNDANCY 


75 


Suppose now that the components do not have 
identical failure distributions, but that they still 
perform the same function. Then, reasoning as 
before, we obtain 


n 
Rys(t) = E sa 


1 (1- paFi(0| m 


{1- PF (0) e 


m 
Rept) = [1- ert Fo]? 


: f : Ti iment ie pF,(01] a 
al 


and 


rary 


MAXIMIZING EXPECTED LIFE 


We have obtained the reliability of parallel and 
series circuits based on our probabilistic model. 
It is an easy matter to optimize the reliability 
function for a given value of t. However, in many 
cases the desired operating life is indeterminate. 
Perhaps a better criterion in this circumstance 
is to optimize the expected time to failure over 
m or n. If our failure distribution is exponen- 
tial, then it is an easy matter to calculate this 
quantity. 

If Gy (t) = 1 - Rp(t), 


expected time to failure for a parallel circuit. 
Similarly, let Gpj(t) = 1- R(t). If F(t)=1- 


co 
then J tdG,,(t) is the 
) 


exp(-At), then it can be shown that 
co m m-k 
Es p 
Jt dG y(t) = 1/a Doe (3) 
0) k=1 
for a parallel circuit and 
rere) n n-k 
a lies 
jedg,@) aa, SB (4) 
0 k=1 


for a series circuit. This pleasing result leads to 
the observation that the optimal m or n canbe 
determined solely on the basis of a knowledge of 
p. Let pm be the solution in p to 


Wy ee ph “a Oo ae 0. (5) 
=1 


The solutions p,, determine critical intervals 
for p. Table I has been constructed for values 
of m and n upto 10. Of course, one should use 
a series circuit for values of p less than 0.5 and 
a parallel circuit for values of p greater than 
0.5. 

As an example of how an engineer might use 


76 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


this table, suppose that he desired to parallel a 
number of switches in order to increase the over- 
all reliability. Furthermore, suppose that of 
those switches that had failed in the past, 95 per 
cent had been open-circuit failures. Assuming 
that failures occur by chance rather than through 
wearout, we can examine Table I to determine the 
optimal number of switches. From the table it 
can be seen that m = 9. 


TABLE I 


PARALLEL AND SERIES CIRCUIT 
PARAMETERS WHICH MAXIMIZE 
EXPECTED LIFE 


Optimal m for | Optimal n for 
a Parallel a Series 
Circuit Circuit 


Critical 
Intervals for p 


p < 0.040 
0.040 < p< 0.046 
0.046 < p< 0.054 
0.054 < p < 0.063 
0.063 < p < 0.077 
0.077 < p < 0.096 
0.096 < p< 0.125 
O25 0)175 
0.175< p< 0.272 
0.272< p< 0.5 

p= 0.5 
035 aap 07128 
0.728 < p< 0.8250 
0.8250<p < 0.875 
0.875 << p< 0.904 
0.904 < p< 0.923 
0.923 < p< 0.937 
0.937 < p< 0.946 
0.946 < p< 0.954 
0.954 < p< 0.960 
p > 0.960 


more than 10 
10 
9 


OOD OK © DRI RP RRR RRR eee 


Mee... 


10 
more than 10 


We now wish to prove (3). Letting F(t) = 1- 
exp(-At), we obtain 
m- 


Rp(t) = u () exp[-A(m - k)t] pk [1- exp(-at) ]£ 


Expanding and integrating, we obtain 
m-1 


Pt dGmn(t) = (m/x) Dy [p™I-i(1 - pit} 
: ig (a - k 
ae 1))p™] (em - 


APRIL 


Note that if 


co et m-k 
[tdGm(t) = (1/») 22 : 
O eal 


then 


[o°e) a m-k 
ft dGusy( = (e/r) DP 
O k=1 


+ (1/d)/(m+ 1). 


m-l . 
Let Ry,=m >, [p™ (1 - pitts (-1)p™] 


We assert that Ry, = PR + 1/(m+1). If we 
can show this, the result will follow by mathe- 
matical induction. It is a straightforward calcu- 
lation to see that 


m 
Rm+1 = PRm = DL, [p™-J(1 - pym+1 + pm+t(-1))] 


Let k=j+ 1. Then we wish to show that 


m+1 ie ) 
») m+1-k/7_ ,)k m+1/_4)k-1 k-1) = ele 
k=1 lp (1 P) =R ( 1) | k m+1° 


Note that (i ‘) = [(m+ 1)/k] Ce . Hence 


we need only show 


m+1 
Cope Gree 1 
> (p) sto) k “pur 


This can be verified directly. 
We now wish to justify the construction of our 
table. Let 


atm 
k=1 


and Dm(p) = Sm+4(P) - Sp(p). 


Recall that p,, is the solution in p to Dy (p) = 
0 [see (5)]. Solutions always exist since Dm (0) = 
-1/m(m+1) and Dy)(1) = 1/m+1. It is an easy 
matter to see that pi: = 0.5 and p2 = 0.728. 
Also, Di(p) = p - 3 is clearly increasing in p 
for p>0.5. Hence, m = 2 is to be preferred to 
m = 1 inthe interval 0.6<p< 1. It will be suf- 
ficient in general to show that Dy (p) is an in- 
creasing function of p for p> py. Suppose 
this assertion is true for m=k. Then 


960 SCHAFER: INTERVAL ESTIMATION OF PRODUCT RELIABILITY 717 


Dy, + 1(p) = p Dy(p) - 1/(m + 1)(m + 2) 
mplies 


d d 
Gp Dit 1(P) = Dy(p) + G Dy() > 0 for p> px, 


since Dj,(p,) =0 and D;(p) is increasing for 
9>Pp, by assumption. Since Dy+1(p_) < 0, 


surely Dy44(p) is increasing for p > pyy4- 
Appealing to the axiom of mathematical induc- 
ion, we conclude that our assertion is true for 
ull positive integers. 

Eq. (4) and the values in the table for series 
circuits hold by symmetry. 


References 


1] R. Barlow and L. C. Hunter, ‘‘System efficiency and 
reliability,’? 1959 IRE NATIONAL CONVENTION 


RECORD, pt. 6, pp. 104-110. March, 1959, 

[2] J. P. Gilmore and V. H. Levi, ‘‘Reliability Through Re- 
dundancy,’’ Symposium on Military Electronic Relia- 
bility and Maintainability, Rome Air Dev. Center, Rome, 
N. Y., vol. 3, ASTIA Doc. No. AD- 148953, November , 
1958. 

[3] J. P. Lipp, ‘‘Topology of switching elements vs relia- 
bility,’’ IRE TRANS. ON RELIABILITY AND QUALITY 
CONTROL, no. PGRQC-10, pp. 21-34, June, 1957. 

[4] F. Moskowitz, ‘‘An Analysis of Redundancy Networks,”’ 
Rome Air Dev. Center, Rome, N. Y., Tech. Note, ASTIA 
Doc. No. AD-148588; February, 1958. 

[5] H. W. Price, ‘‘Reliability of Parallel Electronic Com- 
ponents,’’ Symposium on Military Electronic Reliability 
and Maintainability, Rome Air Dev. Center, Rome, N.Y., 
vol. 3, ASTIA Doc. No. AD- 148953; November, 1958. 


INTERVAL ESTIMATION OF PRODUCT RELIABILITY BY 
USE OF THE NONCENTRAL t DISTRIBUTION 


R. E, SCHAFER* 


INTRODUCTION 


The use of certain exponential distributions in 
describing chance failure functions is well known 
in reliability literature. Methods are available 
for obtaining ‘‘best’’ estimates of the parameters 
of these distributions and confidence intervals 
for the true values. 

In many cases, notably in wear-out failure and 
stress-to-failure distributions, the frequency 
function is often well represented by the normal 
distribution function. 

Certain statistics obtainable from a normal 
distribution are used quite extensively in esti- 
mating product reliability, yet not a large amount 
has been written about interval estimates for 
these statistics. 

It is the purpose of this paper to consider the 
statistic 


where 


U is a constant, 


*Semiconductor Div., Hughes Products, Culver City, 
Calif. 


x is a Sample arithmetic mean and an esti- 
mate of the population arithmetic mean, 

s is the unbiased estimate of the population 
standard deviation (c) obtained from a 
sample; 


and to develop an approximate method for obtain- 
ing interval estimates of t. Since x and s in 
the statistic are only estimates and are subject 
to sampling error, the statistic t will also be 
subject to sampling errors and clearly will have 
a sampling distribution. This sampling distribu- 
tion is referred to as the noncentral t distribu- 
tion. 

Throughout this paper it is assumed that the 
characteristic in question is normally distributed. 
Methods are available to check this assumption 
which range from ‘‘by eye’’ tests to statistical 
tests with calculable risks of wrong decision. 
The methods themselves need not be of concern 
here. 


RELIABILITY CALCULATION 


Before considering the sampling distribution 
of t, we will show, by example, how product re- 
liability may be calculated from this statistic. 

Consider an electronic component subjected 


78 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


to extremely heavy constant load conditions until 
failure. A histogram of the length of life of a 
large number of these components would appear 
as shown in Fig. 1, where it is assumed that the 
distribution shown is well approximated by the 
normal distribution. Further, it is assumed that 
the arithmetic mean life is known to be uw and 
the standard deviation is known to be go. Thus, 
there is the normal distribution function shown in 
Fig. 2. 


No. of 


Components 


Length of Life 


Bigs 1: 


Now if a minimum specification (U) has been 
placed on the length of life of the component, the 
exact proportion of the components that will fail 
to meet this specification in the long run can be 
calculated from 


Bie 5 OU. <SpR; 


G 
where z is a standardized normal deviate; from 
tables of the standardized normal distribution we 
find 


-Z 2 

1 =p ay) 
———— ax = Ps, 
Van lee Z 


P, then is the proportion of the components that 
will fail to meet the specification of a length of 
life of U hours or better. Then (1-P,) is the 
proportion that will meet or better the above re- 
quirements. In fact, where the variable x de- 
notes length of life, 


Pr (x >.U) = (1- P,). 


This says merely that the probability which an 
individual component will have a length of life 
greater than the minimum specification U is 
1- Pz. In keeping with the definition of relia- 
bility in common use today, 


15 i eal ae 


i.€., the probability of a component performing 


successfully under the conditions we have set is 
1- Pz or R. 


i 
c 


Actually, the above example is a trivial case; 
it was used merely to illustrate the procedure in 
finding R. The reasons for this are clear. 
Knowledge of » and o (population parameters) 
implies complete knowledge and enumeration of 
the population. Thus 


1) the exact number of components above U 
can be counted, and no assumptions about 
the form of the distribution need be made; 

2) rarely, if ever, are » and o known. 


The more usual case then is the situation in which 
we have 


x Estimate of pu 
s Estimate of o 


and the statistic becomes 
ee 
s 


t= 


The following section is devoted to finding 
approximate confidence intervals for the statis- 
tice: 


INTERVAL ESTIMATE FOR t 


Clearly, the mean value of t (which we will 
caliaty) ais 


Since t is a nonlinear function of x and s, U 
being a constant, the variance of t(o}) presents 


APRIL 


L960 


somewhat of a problem. Variances of nonlinear 
functions are difficult to approximate and some- 
times the approximations are not too good, de- 
pending, of course, on the function. 

In this case, however, where t = f(x,s), we 
can expand the function in a Taylor’s series and 
obtain a good approximation to o}, provided that 
we fulfill certain general conditions in addition to 
the usual conditions necessary for the existence 
of a Taylor’s series. These conditions are as 
follows. 


1) Independence of x and s. 

2) If we use only the linear terms of the Taylor 
series, the function should be approximately 
linear in the region of interest for the inde- 
pendent (x, s) variables. The region of in- 
terest for the independent variables might 
be, for example, within + 3 standard devia- 
tions of their mean value. In short, the 
approximation is best when the deviations 
from the mean value are small. 


In general, it has been shown that for w=f 
(x1, X2,...X,), the Taylor’s series expansion, 
evaluated at (Ki, X2,...Xp), yields 


2 2 2 
2 ow 2 ow 2 ow 2 
= oe lee SP 
ow eS) OX, =) ox, ) °Xp 
(1) 


under conditions 1 and 2 listed above.* Return- 
ing then to our statistic 


mens x 


and from (1) 


where 0% is the variance of a standard deviation 
@t Net S9-0 Bt vo -Viaew 
ox si. as S70)" 
but ax 
s 
-ts=-U+x 
cL el 
OSes 
Thus, 
Sihio-a at ieee 
of-(2) &+(5) 9%: 


1c. A. Bennett and N. L. Franklin, ‘‘Statistical Analysis 
in Chemistry and the Chemical Industry,’’ John Wiley and 
Sons, Inc,, New York, N. Y.; 1954. 


SCHAFER: INTERVAL ESTIMATION OF PRODUCT RELIABILITY a9 


but 


2+’ 
st : (2) 


Now it can be shown that as n gets very 
large, the variable t is approximately normally 


Usp 
04 


distributed with mean and variance (2) 


where t is replaced with t', the true value of 
the mean. In fact, for most purposes, when 
n > 30, the approximation is satisfactory. 

Upon inspection of (2), it may be noticed that 
the variance of t is dependent upon the magni- 
tude of t. This seems intuitively correct because 
a proportion below some value U is merely be- 
ing estimated, and in doing this by using the bi- 
nomial distribution, it is found that 


2 P(1 - P) 
a= [PP]. 


The variance of a proportion is dependent upon 
the magnitude of the proportion. Thus, intuitively 
the variance of t should involve the magnitude of 
t in some way and it does. 

The matter of confidence intervals can now be 
considered. Of course, the usual type of interval 
cannot be placed on t; i.e., t+ Ko; where K 
is the multiple of the standard deviation that 
yields a certain confidence coefficient. This 
would be begging the question, since the ot de- 
pends in part on the magnitude of our sample 
estimate t. 

Rather, a probability statement of the follow- 
ing nature is made. 


. 2+t' ; 2+ 4! 
pt (0 Kays an Ae eet ailo eae ) 
=l-a- (3) 


Subtracting t' from each member of the inequal- 
ity and dividing through by 


2+t! 
2n 


t-t' 
- SSS K =1 
Pr (4, < Crete ss i) 
| 2n 


> (ean 


To find the end points of this interval, we can set 


80 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRIL 


tant! 
2+t' 
2n 


LS 


Omitting the subscript of K for convenience, 
(t - t')? 
2a C) 
2n 
ntee= 4ntt' +-2nt' 4 2K° + K tl. 
(In = KK) t'4-=)(4nt) to eens = K? i210; 


Be 


This is seen to be a quadratic equation in t' with 


ae one Ke 
b = -4nt , 
eai2 (it = K) 


Solving, we get 


_4nt + ¥-8K* + 16 nK* + 8nK* 


L 4n - 2t " 


Simplification and rearrangement gives 


4 = 2nt + KY =OK eon (2 et”) 


Within the limits of the assumptions which 
have been made previously, these are the exact 
confidence limits for t; however, in order to 
simplify a rather cumbersome formula (4), n_is 
allowed to become large. Under the assumption 
that n is very large, the denominator becomes 
essentially equal to 2n. 

The quantity under the radical in the numera- 
tor becomes approximately 


2n (2 +t”). 
Then 
ee as ¥-2n (24 t?) 
2n 2n z 
oo 2+ t 
t t+K mn 


Replacing the subscript, we have 


; Vere 
'_ 
t Le ay on ‘ 


In short, the usual method of establishing confi- 

dence limits is satisfactory, as n gets large. 
An excellent discussion of noncentral t dis- 

tribution applications has been given; as has been 


a complete exposition of the principles involved in 


using the distribution in variable sampling plans, 


?N. L. Johnson and B. L. Welch, ‘‘Application of the 
noncentral t distribution,’’ Biometrika, vol. 31, pp. 362- 
389; 1940. 


2n - K (4) 


and tables of tolerance limits for the normal dis- 
tribution. * 


As an illustration of reliability calculation, 


consider the following practical example. 


In a stress to failure test of 100 electrical 
components, the sample mean length of life 
and standard deviation were found to be 


x = 4000 hours, 
s 360 hours, 
n= 100%. 


Consider a minimum specification set at, 
say 3300 hours. Then our estimate of t' is 


_ = 3300 - 4000 
360 


-1.944. 


The normal tables yield a fraction below 3300 
of P=0.026. Thus, we can say that the prob- 
ability of a given component performing suc- 
cessfully is 


R= [h="P 
= 0.974 . 


iH] 


However, this is probably not the exact value for 
R and an interval estimate is in order. Using (3) 
we have, for a= .05, 


2+ t Poet 
Px( K/2V an <tct+ kV ) 


=1-a 


; K < 
since 0/2 1.96, 


we have 
wee yess 
t 1.944 + 1.96 200 
t' = -1.944 + .333. 


Thus the probability is 0.95 that the interval 
t' = -1.611 to t' = 2.277 contains t. 
Proceeding to the normal tables, we get the 
fraction of failures lying somewhere between 


Piwr = 0.0536 and Pypy = 0.0114 
Riwr = 0.9464 and Rypy = 0.9886. 


We are 95 per cent certain that the true reli- 
ability lies somewhere between 0.9464 and 
0.9886. Mathematically, 


_ Pr (0.946 < R < 0.989) = 0.95. 


°C. Eisenhart, M. W. Hastay, W. A. Wallis, ‘‘Tech- 


niques of Statistical Analysis,’’ McGraw-Hill Book Co., 
Inc., New York, N. Y.; 1947. 


1960 
It should be pointed out that the accuracy of 
this method in no way justifies carrying as many 

decimals as has been done. In fact, for very 
large absolute values of t (high reliability), the 
approximation is relatively poor, but the absolute 
error is small. Our estimate of t has smaller 
variance than the estimate obtained from the 
binomial distribution. 


SELECTION OF SAMPLE SIZE 


With the aid of (2), selection of sample size is 
relatively simple, although not exact. First, an 
estimate of the reliability of the product must be 
made. Then, for a certain confidence coefficient 
1- a, the amount of variability in t which will 
be tolerate must be tolerated. 

For example, an engineer estimates the reli- 
ability of a component to be 0.99, but seeks to 
find the ‘‘exact’’ value. 


R = 0.99. 


The normal table gives t = -2.33 for R=0.99. 


Further, the engineer specifies that he wishes to 
be 95 per cent certain that the estimate t of t' 
misses t' by no more than + 10 per cent of t'. 
In effect, he has now specified the width of the 
1.96 o¢ limits of the sampling distribution of t. 


Thus, . 
1.96 o¢ = .10|t'| . 
o¢ = .05|t'|. 


McGUIGAN: IS ANYTHING NEW IN RELIABILITY 81 


But |t'| is estimated to be 2.33; thus 


ot = .05 (2.33) 

= .116. 
Substituting in (2) 

ee a 
ot: 

n= “Qo? 
ae 33ise 

.0269 
n= 203, 


Of course, it is obvious that if we knew t' we 
would not have to submit 273 to test, but at least 
an estimate of t' permits us to predict sample 
size to a certain extent and, thus, schedule work 
loads and budget costs. 


CONCLUSION 


The preceding methods operate to obtain inter- 


“* obtained 


val estimates for the statistic t = u 


from a normal distribution and, thus, are appli- 
cable to any variable which is normally distrib- 
uted. The accuracy of the method is subject to 
fulfillment of the conditions mentioned in the body 
of this paper. Although only a single tailed prob- 
lem was cited, a simple extension is two tails, 
and so are hypotheses concerning t' or the sig- 
nificance of the difference between two t's. 


IS ANYTHING NEW IN RELIABILITY?* 


W. D. McGuigant 


I accepted the invitation of your chairman to 
liscuss the historical aspects of electronic relia- 
vility with mixed emotions. It is quite an honor to 
ypen your seminar. On the other hand, I am not 
juite sure how I came to be regarded as being an 


*This paper was presented at the First Annual Bay 
.rea Reliability Seminar, Menlo Park, Calif., February 
OF 1959. 

_ tEngineering Div., Stanford Research Institute, 
fenlo Park, Calif. 


authority on the historical aspects of this subject. 
The last time I had this feeling was about ten years 
ago. I was trying on a new suit, looking into a 
three-sectioned mirror, when I saw for the first 
time that I had a bald spot. 

I should like to run over a little history with the 
purpose of showing that most of the things we are 
doing for reliability are things we have been doing 
for a long time, frequently under the same titles. 
From this, it should be clear that the way we have 
been fighting reliability we have been solving 


82 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


problems peculiar to the components and systems 
of the moment, and not really making any lasting 
contributions to science. 

The oldest concern of the reliability hunters 
has been with the subject of environment. This 
subject is so old that it actually began about two 
years before the invention of the audion. Dr. Lee 
DeForest was experimenting with the conduction 
and control of electrical current between a Bunsen 
burner and a pair of ringstands. To shut out vari- 
ations due to air currents in the room he enclosed 
the experiment in a tube. Later, seeking complete 
isolation, he substituted an Edison filament for the 
burner and sealed the experiment into a light bulb. 
The birth of electronics in an evacuated glass en- 
velope thus made reliability a congenital problem. 

The fight for better components, materials and 
techniques is an old one too. The work on vacuum 
tubes at the General Electric Company before and 
during World War I should stand as a monument to 
those who would embark on component develop- 
ment programs. One might wonder at the state of 
their art, however, because while nominally trying 
to improve vacuum tubes, they invented an ex- 
traordinary number of gas filled devices. After 
nearly 20 years, their leading genius, Dr. Irving 
Langmuir, even won a Nobel prize for his work on 
electrical discharges in gases. 

Human engineering entered the picture about 
1921. Until Harold Elliott made the first gang- 
tuned Magnavox, it wasn’t just anybody who could 
tune into a signal. This great contribution may 
have triggered an early, technological counterpart 
to Parkinson’s Law. Instead of simplifying the 
radioman’s task, it simply made it possible for 
him to handle a greater complexity of equipment. 

Perhaps another great step in the reduction of 
human factors came with the elimination of arc 
transmitters. These remarkable devices were 
quenched with pure grain spirits of alcohol which 
was Supplied to most radio shacks in fifty gallon 
drums. When this was no longer necessary, it is 
little wonder that engineers redoubled their efforts 
to seek unusual ways of generating and detecting 
electromagnetic signals. 

An early example of system design for reliabil- 
ity came about 1924. This was the invention of 
automatic volume control, I believe, by J. V. L. 
Hogan. Marcus Acheson once characterized this 
invention as an elephant in the jungle of reliability. 
Here, for the first time, electronic components 
could be made to compensate for each other. In 
one step, the short-term reliability increased per- 
haps a millionfold. 

We should acknowledge another class of contri- 
butions which has had much to do with our ability 
to maintain equipment. The evolution of volt-ohm- 


APRIL 


meters, oscilloscopes and tube testers, during the 
1920’s and early 1930’s, are examples of essen- 
tials which gave us tremendous leverage on the 
reliability problem. 

Fail-safe techniques began to appear in the 
thirties. While their results were modest com- 
pared to the results obtained recently by the AEC, 
the designers of the low-frequency four-course 
ranges and, later, the instrument landing systems, 
were quite aware of their obligations to keep radio 
beams from drifting off course. 

World War II deserves special treatment ina 
review of reliability. First, the complexity of 
systems expanded in less than four years by an 
average factor of perhaps ten. Second, a large 
number of people without previous exposure to 
electronics were pressed into its service. Third, 
and perhaps, more important, the pressure to 
hurry and simultaneously to change the art led to 
a tendency to overlook details or at least to com- 
promise them. 

During this period, the reliability experts were 
not electronics engineers. Rather they were a 
perverted lot, equipped with shake tables, drop 
tests, salt spray chambers and the like, who, in 
fact, were unkindly disposed toward electronics. 
Aside from discouraging engineers, they had no 
permanent effect on the electronics industry be- 
cause most of the changes involved things like 
brackets or platings--things without a permanent 
role in electronics. 

One worthwhile set of postwar concepts dealt 
with maintenance minimization. Taylor’s work on 
marginal checking in the Whirlwind computer was 
perhaps the first recognition of the importance of 
drift rates and design margins. In 1950, Devey, 
then of ONR, made people aware of the cost of 
maintenance by setting the now famous mainte- 
nance-to-original cost ratio in the range of 10:1. 
This led to renewed interest in replaceable and 
throw-away packages. A prominent West Coast 
Research Institute even had a project to design a 
fault-finding system for some naval equipment. 

After World War II, the very large system ap- 
peared. For the first time it was impossible for 
one talented, well-trained, versatile, omniscient, 
energetic, personable, persuasive and healthy en- 
gineer to understand every aspect of these proj- 
ects. Computers, planes, and missile systems all 
got so complicated that the projects outlasted an 
average of two and a half chief engineers. It then 
became popular to suspect that the organization 
rather than the people involved might be causing 
the trouble. : 

One organizational example of reliability I like 
concerns one of the districts of CAA. The engi- 
neers in this group had been doing an outstanding 


1960 


job of maintaining their equipment when suddenly 
their records showed a fantastic surge in failures, 
After some research, it was discovered that an 
enterprising purchasing agent had done them in by 
purchasing the entire stock of tubes and magne- 
trons from a war surplus mart. 

Reliability papers in the early 1950’s frequent- 
ly fell into stereotypes. The first type was along 
the theme, ‘‘This is a tremendous problem,’’ usu- 
ally presented by someone of stature in the mili- 
tary. Our commercial friends, meanwhile, start- 
ed giving speeches along the lines, ‘‘Our quality 
control department reports to ‘God’,’’ or “‘We’re 
confused, but all our data is on IBM.”’ 

A very important contribution to reliability, 
not as well-known to most engineers as I think it 
should be, is the turn-about occurring in contract- 
ing procedures. The Electronic Industries Asso- 
ciation committee on Electronic Applications (Re- 
liability) and the DOD’s Advisory Group on the 
Reliability of Electronic Equipment, both under 
the leadership of Lewis M. Clement, undertook in 
1955 to educate some of the conservatives in gov- 
ernment on the significance of their contracting 
procedures. 

The general theme was that the military wasn’t 
likely, ever, under existing procedures, to get re- 
liable equipment. The principal arguments in 
favor of this position were: first, there was no 
correlation between lowest bids and best reliabil- 
ity; second, separation of responsibility for re- 
search, development, production and maintenance 
placed incentives for getting equipment out of the 
door rather than for continued operation. The 
present metamorphosis of this plan is going under 
the name ‘‘value engineering,’’ but I hope none of 
you will be discouraged by that. 

One shouldn’t pass the recent history of relia- 
bility without mention of the art of error detection 
and error correction developed by the computer 
segment of our industry. 


McGUIGAN: IS ANYTHING NEW IN RELIABILITY 83 


To date, the Professional Group on Reliability 
and Quality Control has published nearly 100 pa- 
pers. It is worth noting that the most popular sin- 
gle subject (26 papers) concerns the reliability of 
vacuum tubes. Yet, since January, 1957, there 
has been only one paper on this subject. As fur- 
ther disillusionment to the tube testers is the re- 
cent finding of ARINC that selective testing of 
modern tubes probably only downgrades the tubes 
actually used. 

The point worth noting is that we have accumu- 
lated empirical rather than scientific information 
about reliability. If we look for things in our past 
accumulations that will help design engineers, 
there is relatively little. The factor of ten by 
which reliability has improved in the past 10 years 
is far less attributable to our papers on reliabil- 
ity than to the invention of transistors. 

If we are looking for lessons out of reliability 
history, let me suggest the following. 


1) Far more research is needed, particularly 
in instrumentation and statistics. 


2) We should ban new systems unless they are 
likely to improve performance and cut 
maintenance of an older system by some 
substantial margin. 


3) We should not be carried away by enthusi- 
asts for any large system. Large systems 
will be down for maintenance, even if they 
are so-called essential links in our defense 
system. 


4) Above all, don’t bother with historical data 
on electronic reliability. The components, 
the environments, and the purpose of mod- 
ern systems are different. Like meteorol- 
ogists, you should look at yesterday’s weath- 
er only so long as it helps with tomorrow’s 
prediction. 


84 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


CONTRACTUAL ASPECTS OF RELIABILITY* 


R. W. SMILEY{ 


The release in April of last year of the so- 
called Reliability Redbook, ‘‘A Proposed Reliabil- 
ity Monitoring Program for the Design, Develop- 
ment and Production of Guided Missile Weapons 
Systems,’’ has generated considerable interest 
within Industry about a subject which has long 
plagued military personnel associated with the 
procurement of complex weapons. The basic point 
of argument or discussion, depending upon your 
point of view, revolves about a single sentence in 
it, which I quote: ‘‘The reliability monitoring pro- 
gram proposed here is based on the premise that 
reliability is a parameter that can be quantitative- 
ly specified, estimated, assessed or measured at 
predesignated steps or monitoring points of a 
guided missile weapons system’s life cycle and 
that it can be controlled throughout the phases of 
design, development, production and major prod- 
uct improvement.’’ If one accepts this premise 
(its acceptance is almost ideology to statisticians 
and reliability engineers, but is in the same cate- 
gory aS communist ideology to top-level corpora- 
tion managers), then it follows that it should be 
possible for the Government to specify by number 
in a contract a required level of reliability for 
both development and production. In this way the 
Government could contractually assure that the 
high reliability required of these important and 
complex weapon systems would be attained, and it 
follows that incentive payments for attainment (or 
penalty clauses for nonattainment) could be writ- 
ten into those contracts. Then all that would be 
necessary to implement these contractual require- 
ments is for the contracting agency and the con- 
tractor to agree upon a number of monitoring 
points in the program, establish tests to assess 
the attained level of reliability of either the design 
or the hardware, perform the test by either con- 
tractor’s or service personnel, insert the results 
of these tests into appropriate formulas, and ar- 
rive at a measure of performance upon which the 
original agreed upon numbers of dollars would 
then be paid. Such a program I know would be well 
received by the watch keepers of John Q. Taxpay- 
er’s dollars, and it is being given serious consid- 


*Presented at First Annual Bay Area Reliability 
Seminar, Menlo Park, Calif., February 19, 1959. 
{Lt. Comdr., U. S. Navy, INSORD, Sunnyvale, Calif. 


eration by top-level technical and program man- 
agers in the procurement agencies of the Depart- 
ment of Defense. 

My first thought when I originally received the 
Redbook was a quotation of Lord Kelvin’s which is 
the watchword of the Bureau of Standards and of 
our own Bureau of Ordnance measurement stand- 
ardization program. Eighty years ago, Lord Kel- 
vin said, ‘‘When you can measure what you are 
speaking about and express it in numbers, you 
know something about it; but when you cannot 
measure it, when you cannot express it in num- 
bers, your knowledge is of a meager and unsatis- 
factory kind; it may be the beginning of knowledge, 
but you have scarcely, in your thoughts, advanced 
to the state of science, whatever the matter may 
be.’’? In all fairness to reliability predictors and 
assessors, I must concede that in certain instanc- 
es it is possible to both predict and assess with 
reasonable statistic validity. Unfortunately, the 
degree of validity, or more commonly the confi- 
dence limits, is in direct ratio to the number of 
test specimens and in indirect ratio to the number 
of parts in each specimen. We are considering 
here the problem of contracting for reliability in 
highly complex weapons whose productions rates 
are extremely low and whose unit cost is extreme- 
ly high. Ballistic missiles today cost two million 
dollars apiece and up. In spite of a recent news- 
paper release on the production rate attained by 
Douglas on the Thor IRBM, I think it is fairly ob- 
vious that the aggregate of all missiles being pro- 
duced today falls far short, for instance, of the 
25,000-a-day production rate attained by the auto- 
mobile industry. 

The problem of assessing the attained reliabil- 
ity in a complex weapons system is at least as 
complex, if not more complex, than the weapons 
system itself. Most of us are familiar with the 
definition of reliability published by the Depart- 
ment of Defense Advisory Group on Reliability of 
Electronic Equipment, which states, *‘Reliability 
is the probability that a device, system or equip- 
ment will perform satisfactorily for a specified 
period of time under specified conditions of oper- 
ation.’’ Translated into the kind of reliability a 
missile man understands the definition would read, 
‘‘Reliability means that the missile is ready to go, 
that it checks out the first time during count-down,, 
that it is launched precisely on time, that it travels 

: 
a 


. 


960 


he prescribed course, reaches the target zone and 
letonates at the prescribed altitude with the pre- 
scribed accuracy.’’ Consider just the problem of 
lefining a successful missile flight. For surface- 
o-air missiles, for example, there are at least 
our different definitions of a successful flight, 
vhich depend upon the answer to such questions 
iS: *‘Is a flight successful or unsuccessful when 
he missile has deviated from the programmed 
light after the missile has passed the target?’’ 
‘Is a flight successful or unsuccessful when the 
varhead explodes within the prescribed distance 
rom the target drone but the drone isn’t killed?’ 
‘Is the flight successful or unsuccessful if the 
lrone is knocked down even after a highly erratic 
nidcourse flight?’’ 

There are other problems associated with re- 
iability assessment of missiles besides arriving 
t a mutually agreeable definition of a successful 
light. 

Missiles are designed to be stored, serviced, 
andied, checked out and flown by military person- 
el, Although the contractor usually prepares the 
nstructions governing these many operations, and 
Nn many cases provides the ground handling, check- 
ut, storage and launching equipments, it is still 
xiomatic that the environment in which the mis- 
ile lives after it leaves the factory is not under 
he contractor’s control. It can be argued that it 
s the development contractor’s responsibility to 
o design his product that it can successfully with- 
tand normal service environment, but when an 
xchange of dollars hangs on the performance of a 
roduct which has been in the customer’s hands 
or a year or more, it is fairly obvious that a con- 
ractual debate will ensue on the reason for every 
zilure. 

Another problem stems from the normal con- 
ractual arrangements for the supply of the many 
ems which make up a weapons system. It is rare 
ven under the ‘‘weapons system manager’’ con- 
ept that a single prime contractor has total re- 
ponsibility for the design and supply of all the 
quipment in an entire system. Sometimes there 
3 a separate prime for power plants and for 
heckout equipment; usually there is one for fire 
ontrol equipment; occasionally even different 
arts of the missile itself are developed and/or 
roduced by separate prime contractors. Under 
1ese conditions who is responsible for saying— 
1deed, who can say--that a flight would have been 
uccessful except for manufacturer A’s product? 
ven in those programs where reliability figures 
re presently released only for information, there 
; constant bickering among the various contrac- 
yr and military elements of the procurement, 
istics and operating team as to ‘‘who shot John’’ 


SMILEY: CONTRACTUAL ASPECTS OF RELIABILITY 85 


for the relative levels of reliability or for the as- 
signment to the various elements of the causes of 
unreliability. 

Considering these and other difficulties, con- 
tracting for reliability along the classic lines, 
where definition of the requirements and assess- 
ment of the delivered item is the basis for pay- 
ment, is hardly a practical approach to our prob- 
lem of attaining reliability through contractual 
action, at least at the level of government prime 
contracting. When we can afford it, I think every- 
thing from bits and pieces to minor assemblies 
can be purchased to the stated premise of the Re- 
liability Monitoring Program. 

Materials at these low levels are characterized 
by limited numbers of attributes and therefore 
have limited modes of failures. The subcontracts 
for these minor assemblies can thus require that 
the designs or material meet specified reliability 
goals, and the assessment of the attainment of 
those goals can normally be made under previous- 
ly agreed-upon laboratory conditions. Unit costs 
of items at this level are low enough that we can 
afford to buy an extra 50 or 100 relays, or valves, 
or even gyros, to make an assessment within con- 
fidence limits that will permit the payment of in- 
centive reward or the assessment of nonachieve- 
ment penalties. I stress minor assemblies as the 
most complex to be tested because I wonder how 
many inertial guidance capsules, worth roughly 
half a million dollars apiece, we can afford to test 
in laboratories; or how many two hundred thou- 
sand dollar ballistic missile solid rocket motors; 
or, for that matter, how many thirty-five thousand 
dollar surface-to-air solid rocket boosters. And 
unless we can afford to test many, the confidence 
limits are as wide as our CEP. 

In brief, then, these are the basic problems 
connected with specifying in prime missile con- 
tracts that a weapons system, or a weapon, must 
have a certain reliability: 


1) The difficulty of arriving at the definition of 
the attained reliability. 


2) The exorbitant cost of making enough tests 
to assess the degree of attainment. 


3) The difficulty of an accurate assignment 
among elements of the system and organiza- 
tion of attained unreliability. 


It might be inferred from the above that the 
Government can do little to attain the degree of 
reliability which is required of our present-day 
complex weapons system. Such is not the case. 
Furthermore, it can be done through the contrac- 
tual medium. But the approach is one of engineer- 
ing instead of one of statistics, a strategic ap- 


86 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


proach instead of a logistic one. It is the approach 
that is currently being used in the Polaris pro- 
gram. 

It is basic that to attain any goal in a procure- 
ment program—economy, high production, sched- 
ules, high quality, or a producible design—manage- 
ment attention must be focused on the attainment 
of the goal. Obtaining and continuing such manage- 
ment attention is the basic premise underlying 
today’s ballistic missile reliability program. Al- 
though there is still much to learn about how best 
to achieve that attention, we are rapidly moving 
toward an era where the contractor’s reliability 
system and organization will be as well defined 
and contractually required as his quality control 
system and organization are today. 

We need reliability today on the weapons we are 
developing and building today, and we cannot afford 
to wait until reliability prediction and assessment 
techniques mature to attain that reliability. Our 
experience in early missile programs has taught 
us dramatically and forcefully which elements of a 
reliability program are most needed to attain re- 
liability. 

Briefly and in their approximate order of im- 
portance, the principle elements are: 


1) A complete design disclosure. It may be 
surprising that this element is considered to be 
the most important contribution to reliability.> It 
has, however, been our unfortunate experience 
that the value attained from the other elements is 
lost unless the design disclosure is thorough and 
complete. It does us little good to analyze thor- 
oughly and test a particular relay, to assess and 
improve its reliability, only to have the vendor 
change his process or his design without our 
knowledge. We, therefore, place the highest em- 
phasis on the completeness of drawings, specifi- 
cations, factory test procedures, field service, 
test and handling manuals and the other parts of 
the design disclosure package. Speaking for the 
Bureau of Ordnance, this package is the means by 
which the Government attains and retains control 
of the product, and it is through the medium of the 
design disclosure package that each level of con- 
tractor can control the product he receives from 
the next lower level. 

2) The existence of an element of the contrac- 
tor’s organization, whose primary function is re- 
liability, whose stature is adequate to insure that 
reliability is heard within the organization, and 
whose budget and personnel ceiling is large enough 
to accomplish the required tasks. Just as the 
Government in the past has delineated a contrac- 
tor’s quality control organization to attain the re- 
quired level of quality, so we must now delineate a 


APRIL 


contractor’s reliability organization to attain the 
required level of reliability. 

3) A program of reliability testing. The high 
cost of development for complex weapons makes it 
imperative that reliability groups integrate into 
normal development testing as many of the relia- 
bility requirements as possible. However, since 
time is a parameter of more importance to the re- 
liability engineer than to the development engineer 
it is also necessary that there be additional test- 
ing to destruction, testing in overenvironments, 
and overtesting in environments which are delib- 
erately chosen to produce failures, in order that 
we may assess the time-to-failure aspect. It is in 
this element that we expect to lay the greatest em- 
phasis on designs of experiment, to insure that we 
get maximum data with the minimum expenditure 
of precious costly hardware. 

4) A separate continuous detailed review of 
basic designs for reliability. In this day of spe- 
cialization, one of the anomalies which we fre- 
quently face (and one which I am most at a loss to 
understand) is that in which development person- 
nel resist a separate organization to review de- 
signs for reliability. ‘‘Reliability is everybody’s 
business’’ goes the axiom, and its proponents 
suggest that we can therefore infer that a design 
coming from a designer whose business is also 
reliability must be reliable. Unfortunately sad 
experience has indicated that what is everybody’s 
business is nobody’s business, and the job doesn’t 
get done. Just as we expect engineering personnel 
whose business is production to review designs 
for producibility, so we must expect engineers fa- 
miliar with the basic principles of reliability, and 
having reliability as their prime goal, to review 
designs for reliability. 

5) A feedback system, usually called a Failure 
and Trouble Report System coupled with an ade- 
quate corrective action system which will insure 
that our experience in the field will result in 
changes to existing and forthcoming hardware and 
design. 

6) An adequate contractor’s and Government 
quality control organization and system to insure 
that hardware is produced in accordance with the 
design requirement. This is not a new element, 
but it does little good to develop a reliable design 
and document it so that its requirements are well 
defined, and then fail to insure that the hardware 
produced actually meets all those requirements. 


At least two of the six elements I have just 
listed—-the requirements for design disclosure, 
the requirement for quality control organization _ 
and system—have been in Bureau of Ordnance con- 


tracts for many years. The requirements for the : 


960 


ther elements are finding their way into newer 
ontracts and we are slowly learning to define in 
nore exact terms that for which we wish to con- 
ract. The Bureau of Ordnance hopes to have 
rithin a few months a definitive document speci- 
ying these additional elements which we will in- 


DAVIS, WAHRHAFTIG: RELIABILITY PREDICTIONS, A CASE HISTORY 87 


clude as part of the contract requirements. We 
firmly believe that with proper system and organ- 
ization, backed up with planning, funds and facili- 
ties, it is entirely practical to attain with today’s 
techniques the reliability we need today for today’s 
weapons. 


RELIABILITY PREDICTIONS, A CASE HISTORY #* 


R.A, DAVISt AND W. WAHRHAFTIGt 


The process of predicting reliability is usually 
ceated like the weather; but we found that due to 
ontractual requirements we had to do something 
bout it. A description is presented here of the 
1ethods used in predicting the reliability of a 
iece of complex electronic equipment and an 
valuation of the results based on field failure 
ata. 

The study to be described was performed main- 
7 by the reliability engineers of Philco’s Western 
evelopment Laboratory with an assist from Lock- 
eed’s XA Weapons System Reliability Engineer- 
ig Department. The data used for prediction 
ere supplied through the courtesy of RCA at 
ape Canaveral, Fla. The data used in evaluation 
ere taken from Trouble and Failure Reports 
sed in the field by operating personnel. 

One of the major pieces of ground equipment of 
le early part of the XA Weapons System is the 
Verlort’’ Radar. This is a modification of the 
lod II radar used at AFMTC. It in turn is a mod- 
ication of the 584 radar. In order to evaluate the 
eliability of the Weapons System, it is necessary 
) study its components. This paper shows how 
ie very long-range tracking radar, the Verlort, 
as studied. 

For prediction, the approach taken was first to 
yaluate the existing Mod II radar and extrapolate 
ie results to the new system. To make this eval- 
ution, the operating logs of the Mod II radars 
ere examined. It was found that about half of 
ese had sufficient data to make statements about 
ean time to failure and mean time to repair. 


*Presented at First Annual Bay Area Reliability 
minar, Menlo Park, Calif., February 19, 1959. 
+Western Dev. Lab., Philco Corp., Palo Alto, Calif. 
{Missiles and Space Div., Lockheed Aircraft Corp., 
anyvale, Calif. 


It should be stated at this time that in order to 
predict reliability using any model, the important 
statistics to be determined are mean time to fail- 
ure and mean time to repair. 

The data gave: 


8388 hours 
343. 


This yields a mean time between failures, T, of 
24.4 hours. Mean time to repair, t, was comput- 
ed by averaging the individual times to reapir, and 
was found to be 1.85 hours. 

Fig. 1 shows time to failure plotted with a 90 
per cent confidence interval for each point. This 


Total operating time = 
Number of failures = 


I 


NORMALIZED TIME -TO-FAILURE FREQUENCIES 


MOD II RADAR THEORETICAL 


MOD II FIELD FAILURE DATA WITH q 
90% CONFIDENCE INTERVALS . 


LONG RANGE RADAR ESTIMATED —-—-—-—- 


PROBABILITY THAT TIME TO FAILURE EXCEEDS T 


TIME T (HOURS) 
Fig. 1. 
also demonstrates how well the data fit the fre- 


quently assumed exponential model for time to 
failure. In Fig. 2, time to repair is similarly 


88 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


plotted. It should be noted that this, too, follows 
the exponential model. 


NORMALIZED TIME-TO-REPAIR FREQUENCIES 


ee 


THEORETICAL -——---—--- 


FIELD FAILURE DATA 


eae ial ee 


6 8 10 12 14 
TIME T (HOURS) 


PROBABILITY THAT TIME TO REPAIR EXCEEDS T 


Fig. 2. 


To evaluate the reliability of the new radar it 
was necessary to determine how it differed from 
the existing model. The Mod II has 60 major com- 
ponents. The Verlort has 88. These 88 fall into 
three categories: existing, the same as used in 
the Mod II; modified, similar to those used but 
with varying degrees of change; and new, not used 
in the Mod II. 

The most direct method for extrapolating the 
field failure data of the AFMTC radars to the very 
long-range system is to assume that the 88 units 
of the Verlort system have the same average fail- 
ure rate per component as that found on the 60 
units of the Mod II. 

Such an assumption appears to be valid. In ex- 
amining circuits of both the old and the new com- 
ponents, the application of parts, the circuit con- 
figuration, and packaging employed were found to 
be similar (some circuits were actually analyzed 
using the stress analysis techniques described by 
RCA). The functions to be performed by the new 
circuits were not radically different from the 
original circuitry, and so no attempt to push the 
‘state of the art’’ had to be reconciled. The cir- 
cuits that were modified were for the most part 
changed to improve reliability. 


Mod II failure rate 


APRIL 


Average component failure rate 


_ 0.04098 


home 


= 0,000683/hour 


Long-range radar failure rate 


= 0.000683 x 88 = 0.06010/hour 


Long-range radar mean time to failure 


Sisdginad 
0.0610 


= 16.6 hours 


As a check, another method was utilized in 
making the extrapolation. A component layout 
drawing was obtained from Reeves Instrument 
Company. It showed the block layout of the very 
long-range radar and general information regard- 
ing the parts within each component in the block 
diagram. Failure rates were then assigned to the 
Verlort by dividing the 88 components shown on 
the Reeves drawing into the three categories—ex- 
isting, modified, and new. The estimated mean 
times between failure were computed by compar- 
ing the existing-component complexities in terms 
of their failure rates with the knowledge available 
about components being modified and new compo- 
nents. Summation of failure rates for the long- 
range radar yields the system mean time between 
failures shown below. 

The failure rates are a summation of the re- 
ciprocals of the estimated mean time between 
failures. 


Number Failure Rate 

Unchanged components 46 0.01819651 

per hour 

25 used once 
5 used more than 
once 

Modified components 20 0.02168838 
New components ~ 22 0.02318780 
TOTALS 88 0.06307269 

per hour 


Mean time between failure 


sy eal sh He TES 
0.06307269 ~ 15.85 hours. 


The two methods used above yielding mean 
time between failures of 15.85 and 16.6 hours in- 
dicate that 16.0 hours is a good approximation to 
be used. 

Both procedures required that for both system: 
the population is composed of items of similar 
construction with similar modes of failure and 


L960 


that over a long period of time the MTBF of such 
complex systems approaches an asymptotic value. 
The model assumed for the time to failure is not 
important. Although our data as indicated on Fig. 
1 fitted the exponential model, the same assump- 
tion of homogeneity would apply to any other mod- 


el. Fig. 3 shows the percentage part replacements 


for both the Mod II and the long-range radar, and 
Supports the assumptions. 


RELATIVE FREQUENCY OF PART REPLACEMENTS 


ELECTRON TUBES | 60.4%] 
51.0% 
RESISTORS 8.3% 
WA% 
RELAYS 5.2% 
95% 
TRANS PAC 4.2% 
NOT USED 
IN MOO IL 
TRANSISTORS 4.2% 
SWITCHES 3.1% 
TRANSFORMERS 3.1% 
FUSES 3.1% 
INDICATORS 3.1% 


PLUGS 2.1% 


MISCELLANEOUS 


T 
NOTE: SOLID BAR INDICATES MOD IL 


Bigs: 


As for mean time to repair, this is mainly a 
function of troubleshooting procedures and per- 
sonnel training. Since we anticipated these to be 
about the same as those used at AFMTC, we as- 
sumed the same mean time to repair—1.85 hours. 

- Let us now see how good these estimates were. 
One hundred and thirty-four Trouble and Failure 
Reports from operatirig Verlort radars were ana- 
lyzed. These reported 56 ‘‘operating failures’’ in 
1140 hours of operating time. This yields a mean 
time to failure of 20.4 hours. The 90 per cent 
confidence limits on this place the estimate be- 
tween 16.6 hours and 25.8 hours. This means 
time to failure is represented by the second curve 
on Fig. 1.1. The mean time to repair was found to 
be 1.3 hours. It can be seen that the estimates 
are quite close to the values found in actual oper- 
ation. (If all the failures reported from the field 
had been included, T ~ 13.9 with limits of 11.7 and 
16.9 Shown on the lower curve of Fig. 1.) 

Two points should be mentioned briefly. First, 
2 great deal of benefit is obtained from gathering 


- 1The confidence interval is based on an assumption of 
in exponential distribution that Mod II experience indicates. 


DAVIS, WAHRHAFTIG: RELIABILITY PREDICTIONS, A CASE HISTORY 89 


data of this sort besides its use in determining re- 
liability statistics. From both the Mod II logs and 
the Trouble and Failure Reports valuable infor- 
mation was obtained for pinpointing trouble areas, 
for establishing a preventive maintenance pro- 
gram, and for help in determining spares. 

Second, it should be noted that nowhere above 
was reliability calculated. Reliability, or for 
equipment of this type availability, is a complex 
function of the operating requirements. As a sim- 
ple example to illustrate a difference between the 
two radars, the following model might be used. It 
makes these assumptions: 


1) The system is either in use or being re- 
paired. 


2) If the system is in a failed state at the out- 
set of its use period or fails during the pe- 
riod, it will not be repaired before the end 
of the period. 


3) The time at which the system is required is 
independent of whether it will operate. 


4) The operating time, T, is 16 minutes? 


In this case, reliability, R, is given by: 


For the Mod II R = 0.91, for the Verlort, R = 0.88. 
The foregoing illustrates that for large sys- 
tems we are able to predict, with a fair degree of 
accuracy, those statistics required to evaluate the 

equipment from a reliability viewpoint. In this 
case, we were fortunate to have available a great 
deal of data from which to start. However, in 
most cases, sufficient data exist on similar equip- 
ment and/or similar operating requirements for 
adequate predictions to be made. 


REFERENCES 


[1] ‘‘Verlort Radar: Field Failure Analysis,’’ attached to 
Philco memo from Quality Assurance Group to Distri- 
bution on Analyses of ‘‘Trouble and Failure Reports’’ 
for Verlort Radar, Western Dev, Labs., Philco Corp., 
Palo Alto, Calif., January 22, 1959. 

(2] ‘‘Reliability of Long Range Tracking Radar,’’ Missiles 
Systems Div., Lockheed Aircraft Corp. Rept. No. 
LMSD-6152; September 5, 1958. 

[3] D. Bentley, ‘‘Determination of Availability,’’ Western 
Dev. Labs., Philco Corp., Palo Alto, Calif., Tech. 
Rept. No. 1084; August, 1958. 

[4] ‘‘Reliability Stress Analysis for Electronic Equip- 
ment,’’ RCA, Cape Canaveral, Fla., Tech. Rept. No. 
1100; November 28, 1956. 

[5] Shell and Saar, ‘‘Mod II Radar—Preventive Mainte- 


90 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


nance Program,”’ 2nd Edition, April, 1957, Installa- 
tion and Maintenance Range Operations, RCA Missile 
Test Project—PAFB, Florida. 

[6] ‘“‘AFMTC Tracking Radar Mod I—Vol. Il Theory of 
Operation, Trouble Shooting and Repair,’’ Reeves 
Instrument Corp., (Confidential). 

[7] ‘‘Handbook of Preferred Circuits,’’ Navy Aeronautical 
Electronic Equipment—(Nav Aer 16-1-519). Prepared 
by National Bureau of Commerce for Bureau of Aero- 


APRIL 


nautics, Dept. of Navy, September, 1955. 

[8] Pearson and Hartley, ‘‘Biometrika Tables for Statisti- 
cians,”’ vol. 1, Table 40; 1956. 

[9] H. Cramer, ‘‘Mathematical Methods of Statistics,’’ 
Princeton Univ. Press, Princeton, N. J.; 1945. 

[10] N. S. Hawley and J. H. Rowland, ‘‘A Mathematical Mod- 
el for Reliability,’’ Missiles System Div., Lockheed 
Aircraft Corp., Sunnyvale, Calif.; Rept. No. LMSD- 
2278; September, 1957. 


CONTRACTOR MANAGEMENT LOOKS AT RELIABILITY | 
PROGRAM ACTIVITIES* 


W. B. LABERGET 


Most of us who are here today live several 
quite different lives. For a certain portion of 
each day and of each week, we engage in a busi- 
ness conducted in a business environment with 
its specialized set of objectives. Quite separate 
from this, in the remainder of our lives, we are 
engaged in social activities and a pursuance of 
objectives set by our social standards, perhaps 
towards objectives quite different from those of 
our business life. Through periods of our ado- 
lescence, educational development, and through- 
out our adult life, we have modified the way by 
which we live in this social environment, adapted 
to it, and by adapting to it are permitted achieve- 
ment of our objectives. Within this common 
social climate, we are quite able to have indi- 
vidual sets of objectives and individual moral 
and ethical codes. Despite these individual dif- 
ferences, however, we must admit to being ina 
common society. This common society having a 
reasonably diverse set of component parts is still 
regulated to achieve the common good. 

Although we do not, in detail, always know the 
precise direction in which we should go, or the 
detailed actions which we should take, to achieve 
these goals we do not here need a lecture about 
the social environment in which we live. We 
have lived in that environment and we have be- 
come able to know and understand it. 


*This paper was presented at the First Annual 
Bay Area Reliability Seminar, Menlo Park, Calif. 
February 19, 1959. 

tWestern Dev. Lab., Philco Corp., Palo Alto, Calif. 


What I would like to do with you for a little 
while is to present a discussion of the business 
environment in which you live, which perhaps is 
not nearly as well known to you as your social 
environment, and discuss with you a few of the 
constraints and restrictions and opportunities 
which form a part of this environment. As en- 
gineers engaged in reliability activities, you 
form the part of a quite new area of engineering 
activity. 

This activity is deposited in an existing busi- 
ness society whose standards have been un- 
changed for many years. It is as if you were the 
equivalent of a minority group placed in a staid 
suburban area which of its own would not choose 
to have such an addition. If you accept this 
equivalence to a minority group, it must then be 
recognized that it is your responsibility to overt- 
ly strive to fit into the business environment. 

For without your efforts to integrate into the 
work towards the common good, this business 
society will do what its social counterpart would 
do; isolate and wall you from it. 

The reason for-this reaction to a new group 
is the same kind of fear which underlies the 
treatment of social minorities, namely a lack of 
understanding of a minority’s function and a 
worry that it will take over or retard something 
which is not theirs. 

Therefore, let me make a first point: a positive 
effort is required by a reliability group to assure 
all members of the business society that reliability 
groups can and will effectively work toward the 
common business goal in a harmonious integrated | 
way. This is quite crucial. It is a requirement of 
each man in a reliability program as well as the 


60 


equirement of its most senior people. If this ef- 
ort is not made, you will be isolated and walled- 
p by some mechanism or other. By a series 

f organizations and reorganizations, you can be- 
ome further and further detached physically and 
rganizationally from what is really the interest 

f the business of which you are a part. 

If one can achieve social acceptance now, then 
ne can turn to the next most important question, 
hat of organization and responsibility assignment. 
me realizes that there has always been a desire 
nd, in fact, a requirement for high quality in 
quipment produced by the manufacturer. There 
as always been some mechanism established 
ithin the business environment to monitor, 
upervise, and encourage reliability of operation 
f equipment. What is new and different about 
eliability activities has been caused by the tre- 
nendous growth in the technological requirements 
f equipments. What had before been relatively 
traight-forward mechanical arrangements of 
arts now have become very complex assemblies 
f very complex individual parts. Furthermore, 
fter grouping of these parts into assemblies, 
hese assemblies have been further grouped into 
ery major collections of assemblies called 
‘systems.’’ The net result is that the individual 
ontracted item is much more immense than it 
sed to be in its total assemblage of parts. No 
onger does a straighforward workman-like job 
ssembling parts permit assurance that the major 
ystem resulting from these parts is satisfactory. 
‘his has led to a business emphasis on system 
esign, and with it an emphasis on system relia- 
ility, and through it an emphasis on individual 
omponent part reliability. 

If one admits reliability programs have 
hanged, we can look at how management reacts 
9 this change. First, let me say that it is obvious 
9 you and obvious to any competent management 
ctivity that the engineering force, both in its 
esign and in its reliability facets, is the back- 
one of an R and D organization. Without a re- 
able product, there is no reason for existence 
f a research and development or an engineering 
ctivity. However, these are not the only parts, 
or the only important parts of a business organ- 
zation. Although these other parts do not neces- 
arily require engineering education or skills in 
1eir proper execution, they do require a high 
>vel of competence and have a very important 
npact on the business itself. The combined 
arts of this business strive to execute their 
10ral responsibility. That moral responsibility 
hich any properly conducted R and D activity 
as is to show a reasonable financial profit and 


re 


LABERGE: MANAGEMENT LOOKS AT RELIABILITY ACTIVITIES 91 


to place that organization, by expansion of its 
facilities and capabilities, in such a position as 

to ensure its continued growth and ability to pro- 
vide products and services purchasable by its cus- 
tomers. It needs to be clearly recognized and 
clearly understood that not only is it the way 
things happen in a business society, but that it is 
the moral and ethical responsibility of the man- 
agement of any organization to show a profit for 
those who have invested capital in that enterprise. 
Each of you who invest in other enterprises expect 
to see these enterprises grow and prosper and a 
reasonable return to be made upon the investment 
which you have made. So also the business in 
which you are now working is required by the 
same moral and ethical responsibility to provide 

a profit to its supporters. 

What this means, therefore, is that the maxi- 
mum economy of operation must be exercised in 
order that one can most straightforwardly and 
most economically pursue the objectives of this 
business enterprise. Surely the engineering staff 
does not wish to consider itself as accountants or 
controllers or plant facility people but similarly 
the plant facility personnel, accountants and the 
controllers frequently do not wish to be consid- 
ered engineers. Each are separate and integral 
portions of the business which is being conducted. 
The function of a business management is to eval- 
uate the individual contribution of each of the in- 
tegral parts of the business operation and to pro- 
vide a management structure which places them 
in a proper line position to ensure correct em- 
phasis on these individual parts. 

‘With respect to reliability, proper organization 
is perhaps one of the most challenging problems 
which a management can have. Within each of the 
major structural elements of a business organi- 
zation, the influence of reliability activities is felt. 
There is no clean-cut separation of reliability in 
the engineering, plant operations or the fiscal or 
the production areas. An example of this is per- 
haps the obvious one of a quality control program. 
It must effect its influence not only in the engineer- 
ing development of a product, but on the methods 
for high-scale production. It must exert its in- 
fluence upon the purchasing side of the house by 
proper selection of component parts vendors, and 
also it inevitably must affect the controller’s office 
through the cost that the business must bear to 
support quality control. 

Before one speaks much further about the man- 
agement problems associated with reliability, it 
is necessary to define what management wishes 
from a reliability program. Perhaps within the 
context of this paper, these objectives can simply 


92 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


be stated into two parts. First reliability pro- 
grams must participate in the line responsibility 
of an engineering department charged with the 
development and fabrication of a product for a 
customer. Secondly, reliability programs must 
provide an independent audit to management, re- 
porting the prognosis of success of a given pro- 
gram during the early course of its engineering 
development. 

If one admits that the two requirements exist 
for reliability activities, that of line and audit, then 
reliability must be organized both in a line capac- 
ity and in a staff capacity. These two obviously 
incompatible simultaneous requirements present 
a problem to management. They also present a 
problem to you as a member of a reliability pro- 
gram, for your personal success and that of your 
group are measured by your contribution. This 
contribution is frequently significantly modified 
by the organizational structure within which re- 
liability works. 

Therefore, a second point I bring you is that as 
members of a reliability program, you must en- 
sure by discussion with the business management 
that your organizational position within the com- 
pany structure permits achievement of the aims 
of the reliability program. 

The organization of reliability programs in one 
company can differ from the organization in an- 
other, since company requirements differ in many 
ways among each other. A claim cannot be made 
that one way of organization is manifestly better 
than another. Nor does one claim that an organi- 
zation must be permanent. In fact, it must change 
as the conditions defining it change. However, as 
in your social world, so also in your business 
world, before you can act you must be in a posi- 
tion to act. 

If the preceding recommendations were out- 
ward in their relations to others, the next few will 
tend to be more inward looking. Reliability pro- 
grams must recognize that management views 
their activities in a way not really very different 
from the way it views the activities of the account- 
ing, production and purchasing departments. The 
overriding questions always asked are: Is the 
service necessary? If so, what size service is 
required? Is the service provided competently 
performed? Is it worth the cost it sustains? 
These are essential management questions and 
are inevitably asked. 

Therefore, one may make the next points di- 
rectly. Reliability programs must directly ad- 
dress themselves to the problems of the business 
concern for which they work. Perhaps this is ob- 
vious to you, but it is unfortunate how frequently 


APRIL 


this point is ignored. To reiterate, the normal 
reliability responsibility is to work on the pro- 
blem assigned it. 

Many too many times, Parkinson’s law oper- 
ating on otherwise intelligent engineers creates 
a wholly-closed intellectual world where masses 
of paper work are distributed to those within that 
world but out of which no useful output appears. 
A second frequently seen occurrence is that of 
the reliability group which has become so erudite 
that no one can understand it. As it becomes so 
knowledgeable on the details of its subject, it 
ceases to care whether any one of dissimilar in- 
tellectual background can understand it. Finally, 
it feeds only on the praise it internally generates 
for itself. Statistics for statistics sake has a 
place, as does any basic research program, but 
if you accept your check on the basis of work on 
a specific problem, then output is the major mea- 
sure of your worth in this society you joined. 

Immediately after groping to see that there is 
an output, a management must try and ascertain 
if it is a competent output. It is surely so that 
quite some number of management leaders are 
themselves not competent to judge. However, 
they must so judge. This is not unique to relia- 
bility; each major function of a business is like- 
wise judged on the basis of somewhat inexact 
standards. One of these inexact standards is 
one’s view of the competency of the people in- 
volved. To be competent as a group, one must 
employ competent people. A reliability depart- 
ment or an engineering department may obtain 
excellent people or it may acquire poor ones. 
Good people tend to attract good people and poor 
people to attract poor people. So progressively, 
the competence of a group increases or decreases 
but seldom stays static. Inevitably, on a day by 
day basis, management views the quality of a pro- 
gram by the quality of the people it sees in that 
program. Therefore, a next major point to be 
made is that ‘‘a Reliability group must be com- 
petent and appear competent to survive.’’ 

So far, in order, management recommenda- 
tions to reliability activities might be to: 


1) make themselves a part of the business 
they are in; 

2) see to it that they are organized so as to 
permit effectiveness; 

3) work on the assigned problem to them; and 

4) be competent and demonstrate competence. 


Lastly, the most important point, reliability 
must provide a useful output. In this business en- 
vironment of which one has spoken, it is the 
output which is sold. If the return by sale ex- 


1960 


ceeds the cost of the product or service, a profit 
is made. If not, a loss accrues. The long-term 
objective of a business must be to show a suitable 
profit. If a reliability program, by useful output, 
does not contribute a measure equivalent to its 
cost, then major remedial action is incumbent. 
Therefore, this strong emphasis on this fifth and 
last point which is, to reiterate, a reliability pro- 
gram must above all have an acknowledged use- 
ful output. 

By now, if you are still rational and not emo- 
tionally overwrought by the directness of this so- 
called management view, you will have noticed 
perhaps one obvious point. What has been said 


BARBE: DESIGN INFORMATION INTERCHANGE AMONG CO-CONTRACTORS 93 


about reliability programs can be said in almost 
an identical manner about each and every other of 
the major departments of a business. This is as 
it should be. You are a part of this business so- 
ciety, and if accepted by it, should be expected to 
be judged by that society’s standards. 

In conclusion, perhaps one final statement can 
be made. As the speaker gives management’s view, 
Simultaneously you are given a view of manage- 
ment. In this business society of which one speaks, 
you must live with management as it must live with 
you. Each has a common aim, the interesting chal- 
lenge is to see how each can help the other, so that 
cooperatively they can achieve their common aim. 


DESIGN INFORMATION INTERCHANGE AMONG 
CO-CONTRACTORS* 


MARTIN BARBEt 


INTRODUCTION 


My paper will describe some techniques to fa- 
cilitate intercontractor reliability data exchange, 
and will suggest procedures to increase the util- 
ization of such data after receipt. It will also 
outline the normal activities of the Space Tech- 
nology Laboratories and the Ballistic Missile 
Division of the Air Force, in these general areas. 

I would like to establish the boundary limits of 
my discussion by identifying it as dealing with the 
flow of information between contractors, rather 
than the flow of instruction from a prime con- 
tractor or technical direction agency, as the title 
in the program might imply. There are areas 
where the two functions overlap, especially where 
informative material is available to the prime 
contractor, and desired by the associate contrac- 
tors; and neither wishes it transmitted as con- 
tractual direction. However, my topic does not 
primarily deal with this informing or educating of 
subcontractors, although some of the procedures 
described can be used to facilitate this action. 

In late 1957, STL/BMD instituted a system for 
the interchange of nonproprietary data on the 


*This paper was presented at the First Annual Bay 
Area Reliability Seminar, Menlo Park, Calif., February 
19, 1959. 

+Space Technology Labs., Rano- Wooldridge Corp., 
Los Angeles 45, Calif. 


testing of component parts among the Titan con- 
tractors. Fig. 1 may clarify the agencies referred 


to. 
Air Research 
ARDC | & Development 
Command 


Air Materiel Command 


Ballistic 
AFBMD] Missiles 
Division 


Strategic 
Air 
Command 


Technical Aspects 


| Contractual 


Aspects Sree 


STL Technology Labs, 
R-.W Div. of 
R-W Corp. 


\ Coordinating 
Groups; 

No Direct 
Authority 


Contractors! 
Reliability 
Committee 


Contractors 


Quality 
Manager's 
Committee 


Associate 


~ 


Fig. 1—General functions of participants in Titan Missile 
Program, as they tie in with the reliability 
activities described. 


94 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


In mid-1958 this was extended to cover the 
airframe contractors for Titan, Atlas and Thor; 
and interchange set up with NOL (Corona) and 
ABMA (Redstone), plus recent initiation of a 
specialized interchange between Air Force pro- 
pulsion contractors. 

The interchange between Titan contractors in- 
cluded not only test reports, but also general re- 
liability procedural plans, reports on design tech- 
niques and manufacturing processes, procurement 
specifications, and educational bulletins and films 
on reliability. (The interchanges between differ- 
ent programs and between Services have been 
more restricted to component part test reports.) 
It is obvious that such general documents are of 
maximum usefulness near the inception of a pro- 
gram and their utility decreases as procedures 
are established. We have therefore concentrated 
on reports of laboratory tests, run under control- 
led and recorded conditions of load and environ- 
ment, on equipment which is completely identified 
and obtainable by other users. 

It is expected that the cross-program ex- 
changes by contractors on similar portions of 
the weapon system will prove far more produc- 
tive than the ‘‘single program’’ exchanges, even 
though smaller parts and GSE components may be 
common to several contractors on one program. 

Although the reports may not describe tests 
conducted under exactly the environments or to 
the limits on which positive information is de- 
sired, they can often point out modes of failure 
or performance to investigate, which allows a 
much shorter verification test. Such data can be 
of help to a contractor who has only a minor por- 
tion of his equipment of a type on which another 
contractor has been able to afford considerable 
specialization. 

Individual reports of field failures are not 
transmitted between contractors, although infor- 
mation on serious problem areas may be. 


MECHANICS AND MONITORING OF 
INTERCHANGE FLOW 


The interchange is initiated by a contractor-to- 
contractor exchange of list of new reports, avail- 
able on direct request. (See Fig. 2.) STL moni- 
tors this activity by information copy only, and 
give such assistance as organizational meetings, 
issuance of consolidated listings, standard forms, 
(Ske, 

In establishing this activity, STL has followed 
these steps: 


1) Obtain each contractor’s top management 
approval to the identifying of a single in- 


APRIL 


dividual as focal point or ‘‘coordinator’’ for 
interchanges, to correspond directly with 
similar coordinators at other contractors. 

2) Promote cooperative action between co- 
ordinators by personal contact and direc- 
tion where required. 

3) Encourage monthly exchange of lists of ex- 
perimental reports available on request. 
(Requests would follow after examination of 
lists for titles of interest.) 

4) Encourage utilization of listings within the 
contractor’s plants through periodic STL/ 
BMD issuance of subject oriented consoli- 
dated listings of all current reports avail- 
able to all contractors. : 

5) Monitor for prompt cooperation, through 
copies of all requests and transmittals. 


CONTRACTUAL ASPECTS 


The initial program was established by STL on a 
purely voluntary basis, and this. noncontractual 
aspect was later emphasized by a BMD letter to 
contractor top management in the Spring of 1958. 
Almost 100 reports per year have been volunteer- 
ed by each participating contractor so that, while 
contractual direction of the production and offer- 
ing of these reports might be desirable in the fu- 
ture, it has not been imposed as yet. 

There are and have been many efforts made in 
this area of data exchange. I will describe some 
of the current or pertinent activities with which 
I am familiar. 


Electronic Industries Association Activity 


At present, any action on Qualification Data 
Interchange by the QA-1 Group is at a standstill. 
Likewise, the M-5 group of the EIA is no longer 
contemplating data exchange. (However, the sys- 
tem developed for environmental-level coding on 
punch cards may be of interest and information 
on this and on the choice of environments and in- 
tensities is available on request.) 


Aircraft Industries Association Guided Missile 
Committee 


The latest infonmation received is that sugges- 
tions in the area of the data interchange are re- 
ceiving specific attention by the legal staffs of AIA 
members. At present the Aircraft Research & 
Test Committee made up of the chiefs of the test 
labs, has a limited informal exchange of test re- 
ports, on a personal basis. 


1960 


BARBE: DESIGN INFORMATION INTERCHANGE AMONG CO-CONTRACTORS 


Date: 9-6-57 


TITAN PROGRAM FileNo.: 47.-16 


ELIABILITY INFORMATION INTERCHANGE 


BE} Transmittal Subject: Listing of Recent Test Data 
L] Request 


Aerojet — Azusu...........-.....- B. Wilner 

Acrojetp—0 0G. 5 see J. J. Peterson 

ANTON Gig oon oe. Asche anh E. Dertinger References: none 
Avconee Be alacant Nh . J. leary 

B UL Baers 24d ets T. Winternitz 

MARTIN eee ys 5 oe nese A. Rhoads 

pel Aon eee GuRuyriond Enclosures: none 
RoW we meee res need ees Proj. Eng. 


The following Environmental Laboratory Test Reports are available to all 
interested parties: 


No. 

33 Barium Titanate Accelerometers - Gulton 
Su Accelerometer Transmitters - Giannini 
28 5 Volt Mercury Cell Battery - Mallory 

55 30 Volt Primary Battery - Yardney 

oll Self-amplifying Accelerometers - Gulton 


201 Test Report of the Rapid Throwover Relay MS 25024-1 and MS-25035-1 Relay 
19 Accelerometers - Statham 


26 Slot-Antenna - Radiation Inc. 

75 Battery Cells - Nicad 

29 Battery - Saft Voltabloc 

71 50 Hour Salt Spray of Beryllium Specimen 


Las Life tests on 2N135, 2N136, 2N137, 2N123, 4JD1A17, 2N45, SB100 

222 Effect of 2 kinds of flux on printed circuit boards 

rath) Effect of temperature, humidity, temperature shock and salt spray on 
Amphenol No. 111394 connectors 

E-1-B Silver-Zinc Battery, Comparison of AMF and Cook Company Battery 

E-18.1 Rotary Frequency Converter, Functional Tests 


UNLESS OTHERWISE NEGOTIATED IN ADVANCE, ANY ACTIONS GENERATED BY THIS TRANSMITTAL ARE CONSIDERED VOLUNTARY 
IMPLEMENTATION OF EXISTING CONTRACTURAL OBLIGATIONS TO SUPPLY THE U. S. GOVERNMENT WITH RELIABLE WEAPONS 
SYSTEM ELEMENTS, AND DO NOT OBLIGATE THE REQUESTOR FOR ANY EXPENOITURES INCURRED, 


ENDORSEMENT. BY SUCH ACTION, THE TRANSMITTER ASSUMES NO LIABILITY TO ANY PATENT QWNER NOR ANY RESPONSIBILITY 
OR OBLIGATION WHATSOEVER TO PARTIES ADOPTING ANY PRODUCTS OR PRACTICES NOR ANY RESPONSIBILITY OR LIABILITY FOR 
COMMENTS ABOUT ANY PRODUCTS OR PROCESSES. APPROVED PRODUCTS AND PRACTICES ARE THOSE DEEMED SATISFACTORY 
FOR THE PARTICULAR PURPOSES AND STANDARDS OF THE GUIDED MISSILE PROGRAM; ANO NO ATTEMPT HAS BEEN MADE TO 


TEST OR EVALUATE ALL PRODUCTS AND PRACTICES WHICH MIGHT PROVE BATISFACTORY. 


Fig. 2—List of lab test reports for exchange between contractors in Titan Missile Program. 


95 


96 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Project 2, Task B of the Electronic Reliability 
Panel of AIA 


This project is now drafting plans to bring be- 
fore the parent committee, to follow on from the 
previous efforts of the AIA Reliability Sub-Com- 
mittee. Present thinking of the Project is to en- 
courage a series of ‘‘networks’’ for interchange 
among special interest groups, such as the bal- 
listic missile contractors; and later possibly to 
merge these into one. 


“‘HELPER”’? Program of Inland Testing Labs 


The proposed program, involving large-scale 
tests under a wide variety of environments, con- 
ducted by a central test Lab but financed by.a 
number of contributing corporations, has now been 
abandoned. 


Battelle Memorial Institute Program 


Battelle proposed a cooperative data analysis 
and summarization service to a number of large 
concerns, in the fall of 1958, and now states that 
they have the required ten subscribers, at $20,000 
each, This is an extensive activity involving re- 
view by Battelle engineers of contractors’ raw 
data at their plants, and the compiling of this into 
periodic state-of-the-art reports on particular. 
components. If successful as planned, it might 
eventually supersede the direct interchange of 
full report tests; but, for the present, its stage of 
development and longer time-cycle seems to make 
it aim at a somewhat different requirement. 


Activity by the Office of the Assistant Secretary 
of Defense 


In the Fall, 1958, Mr. E. J. Nucci instituted an 
Ad Hock Study Group on Electronic Parts Specifi- 


APRIL 


cation, Management and Reliability, which in- 
cluded the topic of centralizing information on 
component performance. However, until the 
present, their work has been largely prelimi- 
nary, although they are very interested in as- 
sisting any activities mounted in this line by 
non-Government groups. 

The exchange system described works suit- 
ably enough for up to about seven contractors, 
but, if expanded, the problems of communication 
from all contractors to all contractors become 
cumbersome. 

The Air Force is cooperating with ABMA and 
INSORD (Sunnyvale) in attempting to arrange an 
inter-Service exchange of data between all bal- 
listic missile contractors. One plan being dis- 
cussed involves a central disseminating agency, 
working with ASTIA, and utilizing microcards. 
If adopted and proven satisfactory, this should 
increase the availability and utility of the data 
to the end user. The procedure might also as- 
sist by relieving the technical people partici- 
pating in an interchange system of the burden of 
handling the routine transmittals, leaving their 
time for follow-on inquiries or topics not suit- 
able for a standardized transmittal. 

Specific contractual coverage of interchange 
(as now called out on one Ballistic Missile con- 
tract) might facilitate the activity, and this point 
is being considered on new contracts. - 

However, various development in data ex- 
change such as punched-card coding of the ex- 
periment description, centralization of this ac- 
tivity on any national basis, standardization of 
format, summary, procedures or even require- 
ments of component test reporting; all these are 
quite conditional on some positive demonstration 
that the information transmitted will actually be 
believed and demonstrably utilized to benefit the 
national defense effort. 


960 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 97 


HUMAN FACTORS IN THE ATTAINMENT OF RELIABILITY* 


ease LINCOLN | 


Human factors influence the reliability of 
-Omplex equipment in many ways. Such a state- 
nent should surprise no one since human beings 
ire highly involved in the design, fabrication, 
yperation and maintenance of every machine that 
ve have. Despite the general agreement on this 
statement, systematic attention to the human 
actors problem has only recently been extensively 
upplied. In this discussion I would like to concen- 
rate on two different (but related) aspects of the 
1uman factors problem and briefly describe some 
iseful remedies. The distinction between the two 
aspects results from a concern for the operators 
xf the equipment as contrasted to a concern for the 
Jesign and reliability engineers who interact in 
Jesigning the equipment. 

In dealing with the operation of equipment we 
will call upon an activity currently known as ‘‘hu- 
man engineering’’ or ‘‘engineering phychology.”’ 
In our concern for equipment designers we will 
depend, at least in part, upon a body of theory 
and research known as ‘‘group dynamics.’’ For 
convenience I will identify problems involving 
operators as man-machine problems and prob- 
lems involving designers as man-man problems. 
The discussion of man-machine problems will be 
concerned with recommendations regarding the 
reduction of errors in the operation of equipment. 
The discussion of man-man problems will be 
concerned with improving the acceptance of those 
recommendations by design engineers. 


MAN-MACHINE PROBLEMS 


Serious interest in man-machine problems 
s usually identified with World War Il during 
which the design of complex equipment began to 
lace unusual demands on the human operators of 
hat equipment. Plagued by their human limita- 
ions, operators were often incapable of consist- 
sntly performing their tasks in a safe manner. 
Jumerous examples demonstrate that aircraft de- 
signers cannot take anything for granted when a 
yuman being is involved. During the last war, for 
xample, one type of airplane was made witha 


*This paper was presented at the First Annual Bay 
\rea Reliability Seminar, Menlo Park, Calif., February 
9; 1959. 

+Reliability Engrg. Dept., Missiles and Space Div., 
sockheed Aircraft Corp., Sunnyvale, Calif. 


ae 


Se 


large door that permitted access to the push-pull 
rods controlling the elevators. One pilot, not re- 
alizing what the rods were for, strapped a suit- 
case to them before taking off on a flight. After 
take-off he was lucky enough to be able to circle 
and land without crashing. Upon landing he com- 
plained bitterly about the difficulty he experienced 
in moving his elevators, until his own mistake 
was discovered. 

Even in the Civil War, with the relatively simple 
equipment then in use, there were frequent errors 
that could be related to equipment design. It has 
been reported that after the Battle of Gettysburg, 
several thousand abandoned muskets were picked 
up. About one-half of these muskets had two 
charges in them, and more than one-quarter of the 
muskets had from three to ten charges rammed 
into their barrels. Apparently, under stress the 
relatively simple procedure of loading and firing 
a musket became overpowering, just when efficient 
operation was most critical. 

Stories such as these have led some persons 
to say that equipment must be designed so as to 
be ‘‘idiot proof.’’ I object to this term, however, 
because in using it an important point is over- 
looked. When equipment is not designed to pre- 
vent them, or at least reduce their likelihood, 
errors will still be made by the most highly 
trained, intelligent of personnel operating with the 
best of intentions. How many people, for example, 
have never turned on the wrong burner of a stove 
or stalled their automobile engine because they 
forgot to shift gears after experiencing some 
stressful driving condition? 

Errors in the operation of equipment interfere 
with the attainment of reliability in the same way 
as component failures. Their ultimate effect may 
well be a failure of the mission itself. Witha 
man-machine system, therefore, human errors 
are an appropriate subject for the recommenda- 
tions of reliability personnel. 

In order to determine where the human engineer 
should direct his attention, we should first identify 
the types of error to which human operators are 
prone. 


TYPES OF ERRORS 


For convenience, the kinds of errors made in 
operating equipment may be classified as shown 
in Table I. 


98 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


PAB Lia. 


HUMAN ERRORS RELATED TO 
EQUIPMENT DESIGN 


1) Errors of omission 


a) Errors of memory 
b) Erros of attention 


2) Errors of commission 


a) Errors of identification 
b) Errors of interpretation 
c) Errors of operation 


Errors of omission are errors in which the 
operator plays a passive role. They result from 
things the operator does not do. Errors of com- 
mission, in contrast, are errors in which the 
operator is more active. They result from things 
the operator does do. The errors in each category 
will be described in the following discussion to- 
gether with specific examples and possible recom- 
mendations. 


Errors of Memory 


Errors of memory occur when the operator 
forgets to carry out a task or forgets the sequence 
in which an operation is supposed to be performed. 
The latter error is especially likely when the 
operator must work with equipment panels con- 
taining numerous displays and controls. Fig. 1 
illustrates the problem the operator often faces. 


QO ©® OPO) 
O 


(oe) 


o© 
o@ 


® 
Oo 


Fig. 1—A panel layout that produces errors of memory. 


The numbers on the figure indicate the sequence in 
which the controls (small circles), meters (large 
circles), and indicator displays (rectangles) are 
used. As the numbers indicate, there is no re- 
lation between the location of the devices on the 


APRIL 


panel and the order in which they will be used. 

The solution to the problem is quite straight- 
forward. The arrangement of the controls and 
displays must be altered to reflect the sequence 
of operations. The standardized sequence should 
run from left to right and from top to bottom. 
With such an arrangement the operator will not be 
forced to rely so heavily on memory or opera- 
tional manuals. 


Errors of Attention 


When acting as a monitor, the operator is 
frequently expected to notice changes in values 
displayed on a group of meters. Often the meters 
have null points and the interesting readings are 
the ones that show deviations from those points. 
Failure to notice a deviation would be classed 
as an error of attention. 

Fig. 2 illustrates the problem faced by the 
operator when the null points on his meters have 
dissimilar orientations. With this arrangement 
the operator must look at each individual meter 
to determine if any of them show an out-of-toler- 
ance value. To decrease the possibility of un- 
noticed error signals, it is desirable to pattern 
the total display by orienting the null points in the 


Cua 
OBES 


QO 
eo ia 


Fig. 2—A difficult checking task. 


(o) 


i=} 


same direction. The task then becomes that of 
scanning one or two rows at a time to identify 
breaks in the patterning. Fig. 3 pictures this 
principle of display arrangement. 


Errors of Identification 


An error of identification has been committed 
when an object is misidentified and then treated as 
if it were the correct object. There is considerabl 
evidence to suggest that the frequency of errors of 
identification is much higher than that of any of the 
other errors, This high frequency may result fror 
the fact that there are so many opportunities for 


960 LINCOLN: HUMAN FACTORS IN THE ATTAINMENT OF RELIABILITY 99 


rrors of identification. Then, too, errors of this OVERALL TITLE 
ype can be made by the technicians who build the "ae nue es 
quipment as well as the operators who use it. The 
roblem of miswired connections is perplexing 
1any people in electronic industry these days. 


O09 


J20 J21 J22 
UB-TITLE 


1 2 3 4 ee 
‘ © Q J25 200 v27 Q veg 09 OQ ‘e) Q Q Q 
00000000900000 
J J39 400414243 44 45 4G AT 4B 


SUB-TITLE SUB-TITLE SUB-TITLE SUB-TITLE 


EE | PQ] OE | SERVE 
; C9 | 9O1E9 | EE99 
GC) C) C) C) Fig. 5—An improved patch board layout. 
[ot — 
Errors of Interpretation 
J C) C) G) C) Errors of interpretation occur when the 


operator misunderstands the meaning of some 
displayed information and, as a consequence, acts 


in an in appropriate manner. Fig. 6 pictures a 
type of dial display that almost assures the occur- 


Fig. 3—An improved display for a checking task. 


As an illustration of one identification prob- rence of errors of interpretation. Presumably, one 
2m, Fig. 4 represents a patch board that was of the two scales is to be used for controlling the 
ctually used on a piece of checkout equipment. amount of ‘‘lag,’’ and the other scale is to be 
.S can be seen, the numbering of the individual used for controlling the amount of ‘‘lead.’’ The 


arrows provide indirect evidence concerning the 
scale to be used for each purpose. By determining 
which scale increases in the direction of each 


QO OO © Y ® Y ® OOO arrow, the operator can associate Sate with 

yng Spt ign ae scales, A small empirical test showed, however, 

Q © Q Q @ @ that engineers are about as likely to make the in- 
© ® © © © correct association as the correct one. At the 


J2i 


very least, the functional labels should be placed 
near the scales to which they apply in order to 


O9 QOOQO900909099000 eliminate the ambiguity. 


J32 333 (J340 JSS 2 Jes 67) «68 OTST? J80 8 


0000000000009 


339° 400 J69 JTO 7 372, «S73 TA J76 J78 9 =J79 BZ 


OO O00 O90 O989CQ 


J44 545 J46 0 (47 15300 (454 J57 J60 464 


J25 J24 J23 J22 


J48 549 J62 J63 


Fig. 4—Identification problems on a patch board. 


ucks has no regular sequence whatsoever, and 
nly a partial attempt has been made to break the 
ucks up into individual groups. Fig. 5 shows the 
atch board that was recommended by human 
ngineers. On the recommended patch board the 
equence of the numbers has been preserved and 
1e jacks have been more clearly grouped with the 
id of black lines. In addition, the subgroups are 
beled appropriately. These changes should 
ertainly reduce errors of identification with Fig. 6—A dial display associated with errors of 
lis patch board. : interpretation and operation, 


100 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Other errors can also be expected with these 
scales, even after the association problem is 
solved. On the lower scale, the numbers increase 
in a counterclockwise direction. It is a safe bet 
that with such a scale the unnumbered index 
marks will occasionally be read ten units too high 
because people have a decided tendency to read 
scales from left to right. If both scales have to 
be on the same dial, a warning sign concerning 
the scale reversal should be installed near the 
lower scale. 


Errors of Operation 


For our purposes an error of operation is one 
in which the control movement is inappropriate 
to the desired effect. To Fig. 6, I have somewhat 
arbitrarily added a vernier control know on the 
right side of the dial in order to illustrate one 
type of operational error. 

Since most vernier knobs operate through a 
friction drive, a clockwise twist of the vernier 
knob is sure to produce a Scale movement ina 
direction opposite to that produced with the 
coarse control knob. Reversal errors in knob 
operation are the likely result. The replacement 
of friction drives with appropriate gear drives 
will be necessary for the achievement of consist- 
ent movement relationships. 


A Summary Statement 


At this point, I want to summarize those prin- 
ciples of human engineering that I have been able 
only to suggest in this brief survey. 


Equipment design should: 

1) Facilitate the recall of operational 
sequences. 

2) Exploit the effects of patterning in gaining 
the operator’s attention. 

3) Provide redundant cues to object identifi- 
fication. 

4) Eliminate the need for display interpre- 
tation. 

5) Provide for consistent movement relations. 


MAN-MAN PROBLEMS 


Having examined a few of the human factors 
problems associated with the use of equipment, 
we now should consider the problems involved in 
implementing recommendations that result from 
our human factors review. At this point our dis- 
cussion takes on a greater degree of generality, 
since the problem of gaining acceptance for rec- 
ommendations is faced by all reliability engi- 
neers, regardless of the content of their sugges- 
tions. 

When the reliability engineer puts down his 
slide rule and approaches the design engineer 
with a recommendation, he encounters a new 


APRIL 


type of problem that he may find more baffling 
than the technical difficulty he feels he has just 
licked. He must now deal with a man-man prob- 
lem in a face-to-face discussion with a design 
engineer who will probably be polite, but will al- 
most certainly be resistant. The importance of 
this discussion between the reliability engineer 
and the design engineer must not be underesti- 
mated. The best recommendation ever devised 
by a reliability engineer will have little effect if 
it never reaches the hardware stage. 


GAINING ACCEPTANCE FOR 
RECOMMENDATIONS 


In discussing methods for improving the ac- 
ceptance of recommendations, I shall not attempt 
to suggest specific techniques which the indivi- 
dual reliability engineer might use to influence 
the individual design engineer. Scientific psychol. 
ogy has really had very little to contribute in the 
way of improved techniques of this kind. In fact, 
only in the past few years have psychologists ser- 
iously attempted to identify the variables that are 
related to effective persuasion. Recently, for ex- 
ample, a study was conducted in which husbands 
and wives were asked to come to an agreement on 
problems on which they had previously expressed 
conflicting views [3], the resulting discussions 
were then analyzed to determine how the conver- 
sation of the ‘‘winning’’ husband or wife differed 
from that of the ‘‘loser.’’ One of the conclusions 
of the study was that the person who talked the 
most was most likely to come out on top. Perhap; 
this is all that reliability engineers need to do to 
win their points, but they may have trouble if the 
design engineers have also read about the same 
study. 

Since the variables affecting persuasiveness 
have not yet been adequately described, we shall 
all have to continue to depend largely upon our ow 
opinions and experience in determining how to de: 
with other people. However, we need not operate 
completely on our own. We can draw upon the wo 
of various social scientists to supplement our ex- 
perience in providing appropriate working condi- 
tions that will encourage effective interactions be 
tween reliability and design engineers. 

The conditions that we will want to establish d 
pend upon the nature of the resistance that we wis 
to overcome. Examination of the derived needs o 
both the design engineer and the reliability engi- 
neer will provide us with a better understanding « 
this resistance. 


Why Design Engineers Resist Reliability 
Recommendations 

Without attempting a comprehensive discussio 
of motivation, we can still identify some importal 


1960 


needs which affect the behavior of design engineers 
(and the behavior of most other people as well). 

There is an abundance of experimental evidence 
which indicates that a person’s needs influence his 
perception of a situation and his reaction to that 
situation. Right now we are interested in those 
needs that help determine the engineer’s behavior 
when faced with a design recommendation from an 
outside source. Two needs seem to be of partic- 
ular importance. 

Recent surveys show that job security has a 
high rank in lists of job characteristics most de- 
sired by job holders. Just as everyone else, the 
engineer has a need for job security, and his se- 
curity is clearly threatened by anyone who rec- 
ommends a change in a design that the engineer 
identifies as his own. The reception to such rec- 
ommendations is likely to be unfavorable. 

Furthermore, the engineer is likely to be con- 
cerned with his status as an expert, and, again as 
everyone else, he dislikes to have his work chal- 
lenged. This need holds with regard to human 
factors recommendations as well as for recom- 
mendations concerning more traditional engineer- 
ing activity. As one engineer expressed it, 
‘‘Every engineer also likes to think of himself as 
a human factors engineer.’’ Having adopted the 
role of an expert, no one likes to abandon the role 
without showing some resistance. 

In addition to his own needs, the design engi- 
neer is influenced by a wide variety of conditions. 
He may not have a clear idea of the purpose and 
scope of reliability engineering activities or the 
services that reliability engineers can perform. 
He may have received considerable discouragement 
in the past whenever he himself suggested design 
changes. Consequently, a change recommended 
after a design has been frozen is particularly 
likely to receive brusque treatment. Finally, the 
engineer is influenced by his supervisor, who may 
also have negative reactions to reliability recom- 
mendations. 

Clearly, the design engineer is subject to a 
number of influences, both internal and external. 
But what about the reliability engineer? Is he free 
to operate without regard to these pressures? 


Why Reliability Engineers Sometimes Fail 


Unfortunately, the reliability engineer is no more 


able to operate unaffected by personal needs than 
is the design engineer. 

The reliability engineer sees the resistance of 
the design engineer as a threat to his own job 
security. In addition, resistance to a recommen- 
dation is a serious challenge to the status of the 


hee 


LINCOLN: HUMAN FACTORS IN THE ATTAINMENT OF RELIABILITY 


101 


reliability engineer as an expert. As a result, 

he may develop a defensive attitude in working with 
design engineers before his opinion is even chal- 
lenged. 

Typically, the reliability engineer also sees 
himself acting as a critic, a role for which he 
doesn’t particularly care. Furthermore, in this 
role he expects to alienate most of the people with 
whom he works. One reliability engineer has 
indicated his concern by stating: ‘‘After a couple 
years at this job you have so many enemies that 
you either have to go into another department or 
to work for a different company.’’ 

The reliability engineer may also find himself 
in an unfavorable position because, although he is 
a member of a reliability department, he spends 
most of his time working with people in other de- 
partments. In this situation he may find it difficult 
to identify himself with either department. At the 
same time the design engineer sees him clearly as 
an outsider without authority in the design depart- 
ment. 

Finally, the reliability engineer has a real need 
for achievement, a need that is often frustrated by 
design schedules that make his recommendations 
obsolete before they are even well formulated. 


A Resolution of the Acceptance Problem 


Having identified some of the reasons why 
reliability recommendations may meet resistance, 
we should try to outline a program that will im- 
prove the chances of acceptance. A successful 
program must reduce resistance and put the reli- 
ability engineer in a more favorable position. 

For our purposes we will view the lack of 
acceptance as evidence of a morale problem 
operating in a two-man group composed of a design 
engineer and a reliability engineer. 

According to one analysis, morale is determined 
by four factors [1]. To achieve a high level of 
morale, the members of the group must; 


1) Achieve a feeling of group cooperation. 

2) Establish a common goal. 

3) Have specific tasks that are necessary to 
the achievement of the goal. 

4) See that progress is being made in achieving 
the goal. 


The nice thing about this analysis of morale is 
that by satisfying the four determinants we can also 
go a long way toward satisfying the human needs 
previously described. 

Satisfaction of the first determinant of morale 
requires the establishment of conditions that will 
foster a genuine feeling of cooperation between the 


102 


reliability and design engineers. Somehow we 
must get across the idea that the reliability engi- 
neer is part of the design group in which he works 
—not an outsider. 

In order to satisfy the second determinant, 
both parties must come to realize the communal- 
ity of their respective goals. Evidence from stud- 
ies of social behavior suggests that the similarity 
in people’s attitudes and views is directly related 
to the frequency of interaction that takes place be- 
tween them [2]. We should therefore try to pro- 
vide conditions that will help to satisfy determi- 
nants one and two by encouraging frequent inter- 
actions between reliability and design engineers. 
One way in which this may be accomplished is to 
locate permanently the reliability engineer in the 
same work space with the engineers with whom he 
will work, where opportunities for interaction will 
be at a maximum. 

To achieve the greatest benefit from this ar- 
rangement, it must be clear that the reliability 
engineer has a status equal to that of the people 
with whom he most frequently deals. Channels 
of authority must also be established. Practically, 
this means, among other things, that the reliabil- 
ity engineer must have as desirable a work space 
as his design counterpart and that the design su- 
pervisor must clearly establish the authority of 
the reliability engineer in the departmental organ- 
ization. 

It should be noted that these procedures will 
also aid the reliability engineer in identifying with 
the group with which he works. The status of the 
reliability engineer as an expert will be improved 
as well. Once the reliability engineer is accepted 
as part of the design team, the unique contribu- 
tions of both the reliability and design engineers 
will be seen to be directed toward the same goal 
—the design of the best possible piece of equip- 
ment. 

If the limited size of the reliability staff makes 
it necessary for reliability engineers to work with 
several different groups of design engineers, it 
may not be desirable to locate individual reliabil- 
ity engineers within any one particular group. 
Under these circumstances, cooperation may be 
more difficult to achieve. There are, however, 
additional techniques, which should be employed 
in any case. These techniques, which are de- 
scribed in the following discussion, will help to 
compensate for the loss. 

Thus, final achievement of cooperation and goal 
alignment will depend on the development of con- 
ditions that will enable the reliability engineer to 
accept responsibility as a participant in initial de- 
sign stages rather than as a critic whose inputs 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


are often too late. To achieve the desired state, 
the reliability engineer must shift the major share 
of his attention to projects that are in their initial 
stages even if he must, as a result, neglect some 
of the fire drills that are in continual rehearsal. 
The benefits from such a shift in emphasis have a 
large potential. A systematic program for reli- 
ability efforts is almost certain to produce better 
results than a haphazard program. Furthermore, 
the acceptance of reliability recommendations will 
be enhanced by the resulting satisfaction of the 
human needs previously discussed. As the reli- 
ability engineer becomes a responsible contri- 
butor during initial design stages, his contribu- 
tions become less of a threat to the design engi- 
neer. The reliabilty engineer also benefits. He no 
longer has to resent his role as a critic who, be- 
cause of the nature of his role, must expect to 
alienate the people with whom he works. Finally, 
as he sees the influence of his recommendations 
appearing in finished equipment, the reliability 
engineer’s need for achievement should be grati- 
fied. 

During its early stages of development the co- 
operative spirit can be strengthened by a mutual 
program of cross education. From the reliability 
end, the educational program should consist of 
printed design bulletins containing general rec- 
ommendations for the improvement of system 
reliability. These bulletins should be supplemen- 
ted with frequent seminars during which the scope 
of the reliability program is indicated and meth- 
ods of cooperation are discussed. 

The methodology of cooperation must not be 
slighted, for the satisfaction of morale determi- 
nant number three requires the establishment of 
specific tasks related to the accomplishment of 
the mutual goal. It will not be sufficient to estab- 
lish a cooperative atmosphere without also estab- 
lishing detailed procedures for interaction and ex- 
change of ideas. The agreed upon techniques will 
undoubtedly differ from one situation to another, 
but they should be products of a group decision 
on the part of the reliability and design engineers. 

The final determinant of morale requires that 
cooperating engineers must see that they are mak- 
ing progress toward their joint goal of increased 
reliability. To be effective, this feedback should 
be as immediate and specific as possible. What 
is needed are numbers that reflect reductions in 
the frequency with which failures are experienced 
as a result of equipment modifications initiated to 
improve reliability. As the collection of failure 
data itself becomes reliable, special at- 
tempts should be made to use the data to 
supply the desired knowledge of results. Until 


960 


this step is taken, the requirements of our four 
jeterminants will not be fulfilled. 


CONCLUSIONS 


Although the results achieved will depend on the 
particular people involved, it is clear that atten- 
rion to human factors can improve the reliability 
attained in the design and operation of complex 
equipment. 

Human factors are important in the operation 
of equipment for the obvious reason that humans 
are doing the operating. When equipment design 
is inadequate, human operators are prone to 
errors that may seriously reduce the inherent 
reliability of the components of the equipment. 
Design modifications can curtail the opportunities 
for these man-machine errors. 

Human factors are also important to equip- 
ment design because the acceptance of reliability 
recommendations is affected by the personal 
needs that influence the behavior of both design 
and reliability engineers. The lack of effective 
cooperation between reliability and design engi- 
neers may be viewed as a symptom of low morale. 


“ae 


LINCOLN: HUMAN FACTORS IN THE ATTAINMENT OF RELIABILITY 


Resolution of the morale problem requires the 
establishment of working conditions that will help 
the reliability and design engineers to: 


1) Achieve a feeling of group cooperation. 

2) Establish a common goal. 

3) Decide on specific tasks that will lead to 
the goal. 

4) See that progress is being made toward the 
goal. 


The satisfaction of these four determinants of 
morale can be at least partially achieved through 
appropriate organizational arrangements and a 
shift in the attention of the reliability engineer to 
the earliest design stages. 


REFERENCES 


[1] M. L. Blum, ‘‘Industrial Psychology and Its Social 
Foundations”’ Harper and Brothers, New York, N. Y.; 
1949, 

[2] G. C. Homans, ‘‘The Human Group,’’ Harcourt, Brace 
and Co., Inc., New York, N. Y.; 1950. 

[3] F. L. Strodtbeck, ‘‘Husband-wife interaction over re- 
vealed differences,’’ Amer. Social. Rev., vol. 16; 1951. 


103 


104 


IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


ESTIMATION FROM LIFE TEST DATA* 


BENJAMIN EPSTEIN} 


We first discuss problems of estimation where 
_ the life X is described by the pdf 
f(x;0) = e se 
) 

Case I: Life testing is discontinued after a fixed 
number r of items have failed. Items on test 
may or may not be replaced. The number ofitems 
initially on testis n. 

The ‘‘best’’ estimate of the mean life 6 is 
given by 


vax > -0, O> 0: (1) 


Onn x Ty, ale (2) 


where Ty n is the accumulated life on test until 
the rth failure occurs. The observed failure 
times are Xin< X2nc- -S¥pnd- 
For simplicity of notation, we will suppress the . 
subscript n. 
If testing is terminated after r(< n) failures 
have occurred, then in the nonreplacement case: 
i 
Ty, => xj+ (un — r)xyp. (3) 
i=1 
In the replacement case (where r. is now un- 
restricted), 
Ty =nxXy. (4) 


The probability density function of oh in 
either the replacement or nonreplacement case, 
is given by 


f(y) =e ale oo le -ry/0 70 (5) 


= 0, elsewhere. 

From (5) it follows that 
ar gy 2Ty, 

) 7) 
(i.e., as chi square with 2r° of freedom). 


From (6) it follows that a two-sided 100(1 - a) 
per cent confidence interval for @ is given by 


is distributed as X*(2r) (6) 


*This paper was presented at the First Annual Bay Area 
Reliability Seminar, Menlo Park, Calif., February 19909: 
This work was done with the partial seppaut of the Office of 
Naval Research. This material is covered in detail " 
Technical Report No. 4, ONR Contract Nonr-2163(... 

fWayne State University, Detroit, Mich. 


x" (2r) > O< a (ot ? 


2 2 


where x 


2 


of X?(2r) and X*(2r) and a 


(2r) is the upper 5 percentage point 


a(2r) is the low- 


er 2 ‘percentage point of X*(2r). Similarly a 


2 
one-sided 100(1 - a) per cent confidence inter- 
val for @ is given by 

Pha 
> Xx? (Qr) ’ (8) 
a 


where x7 (2r) is the upper q@ percentage point of 


x (2) 

Put into words, 100(1- a) per cent of asser- 
tions of the kind made in (7) and (8) will be cor- 
rect. 


Example 1 


20 electron tubes are placed on test. A tube 
which fails is replaced at once by a new tube. 
The fifth failure is observed to occur 407 hours 
after the start of the life test. 

Estimate the mean life @ and give one and 
two-sided 95 per cent confidence intervals for 6. 


Solution: We are dealing with a replacement situ- 
ation with n= 20, r=5, xs = 407. The total life 
observed is, according to (4), given by Ts = 20x; 
= 20(407) = 8140, Thus it follows from (2) that 
6=T;/5= 8140/5 = 1628 hours. To find a two- 
sided 95 per cent confidence interval for 6, we 


2 
use (7) with Xo. 025(10) = 20.483 and Xo. 975(10) = 


3.247. Substituting in (7) we get the two-sided 95 
per cent confidence interval 795 < @ < 5014. To 
find a one-sided 95 per cent confidence interval 
we use (8) with Xo. 95 (10) = 18.307. Substituting 
in (8) we get the one-sided 95 per cent confidence 
interval 6 > 889 hours. 

Frequently we are not only interested in esti- 
mating @, but also in estimating a quantity Xp 
where Xp is that life such that 


Pr(X > Xp) = p. (9) 


960 

for the exponential distribution 
A A 1 
Xp = 6 logs. (10) 


Miaximum likelihood estimates and 100( 1 - a) 
yer cent confidence intervals for Xp are given 
vy 


1 
2 Ty log u 2Ty log a 
Xx? (ar) < Xp << Xie ) (12) 
o 
in the two-sided case and 
2 Ty log él 
(13) 


=a ee X™ (2) 


in the one-sided case. 

Eq. (13) can be interpreted as follows. 

We can be 100(1- a) per cent confident of the 
truth of the assertion that the probability of sur- 
viving T=2Ty log 5 /x2 (2r) time units is > p. 


This is a tolerance interval statement in that we 
can be 100(1- a) per cent confident of the cor- 
rectness of the assertion that the fraction of 
items surviving 7 or more time units is > p. 
Putting the last statement into reliability language, 
we can be 100(1- a) per cent confident that the 
reliability over [0,7] is > p. 


Example 2 


Given the data in Example 1, estimate x g, 
where X g is Such that 


Pr(X > xg g) = 0.9 


(i.e., the probability of surviving for xg 9 hours 
is 0.9). Give one- and two-sided 95 per cent 
confidence intervals for x9,9. 


(14) 


Solution: log = = log 4 = ,1054. Hence, substi- 
tuting in (11) we get 
(1628)(0. 1054) (15) 


me hours. 


Il 


*0.9 


Substituting in (12) we get the two-sided 95 per 
cent confidence interval: 83.8 < x9,9 < 528 and 
substituting in (13) we get the one-sided 95 per 
cent confidence interval x9,9 > 93.7 hours. 


"A logarithms used in this paper are natural loga- 
rithms. «¢ 


Footnote 1 


EPSTEIN: ESTIMATION FROM LIFE TEST DATA 105 


Example 3 


Given the data in Example 1, find a number T 
such that we can assert with 95 per cent confi- 
dence that at least 90 per cent of the items in the 
population survive 7 hours. (Note that this is a 
tolerance statement. It can also be given a re- 
liability interpretation.) 


Solution: We have noted above that one-sided 
100(1 - a) per cent confidence statements re- 
garding x, are also tolerance statements in 
which we can have 100(1- a) per cent confi- 
dence. Hence it follows from the solution to Ex- 
ample 2 that we can assert with 95 per cent con- 
fidence that at least 90 per cent of the items in 
the population survive 7 = 93.7 hours. 

We are frequently interested in making point 
and interval estimates about the probability that 
the item survives a preassigned length of time t*. 
Denoting this by pjx, we have 


Dy = Prix > t)=6 7? | (16) 


It is obvious how one can make point and inter- 
val estimates of pyx from the corresponding for- 
mulae for by [see (2), (7), and (8)]. In particu- 
lar, a one-sided 100(1 - a) per cent confidence 
interval for px is given by 
2 * 
xe gent $2.73 (17) 
The question may be asked: how large should the 
observed T, be in order that we be 100(1 - a) 
per cent confident that 


pps > € 


-t* 
peze > 7? (18) 
From (17) this implies that 
exp|- x (2r yer 2 a deny, (19) 
or 
Ty DX? (2r)t*/2 log (20) 


The interpretation of (20) is as follows. 
If the total life observed in getting r failures 


exceeds Xi (ar)t*/2 log “, then we can be 100(1-a) 


per cent confident of the assertion that the proba- 
bility of surviving time t* is > y. In reliability, 
considerations we can replace the words ‘‘reli- 
ability over a time interval of length t* is >y.”’ 


Example 4 


Given the data in Example 1, make one- and 
two-sided 95 per cent confidence statements for 
the probability of surviving t* = 100 hours. 


106 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


Solution: The maximum likelihood estimate of 
Pyx , the probability of surviving t* = 100 hours, 
is given by 

Bee sie 100/1628 z 9 0614 - 0.9404. 
Similarly, a two-sided 95 per cent confidence in- 
terval for p,,x is given by 


(qe. 100/ tome 10G/aua y= (0.8817 — bis <0 0U RT, 


A one-sided 95 per cent confidence interval for 
pyx is given by substituting in (17). This gives us 


Pyx > 0.8936. 


We can be 95 per cent confident of the assertion 
that the probability of surviving 100 hours (reli- 
ability over the time (0,100)) is > 0.8936. 


Example 5 


The total life observed in obtaining 5 failures 
is 9205 hours. On the basis of this information, 
can we be 95 per cent confident that the proba- 
bility of surviving (reliability) for a time t* = 
100 is > 0.90? 


Solution: From (20) it is known that in order to 
be 95 per cent confident that the probability of 
surviving for atime t* is > 0.9, it is neces- 
sary that the total observed life 

‘Ts > x? 9,(10)100/2 log a-> = 8689. 
Since the total life observed in obtaining 5 fail- 
ures is 9205, we can answer in the affirmative, 
z.€., we can be 95 per cent confident that the 
probability of surviving for a time t* = 100 is 
> 0.90. 


Case Il: Underlying distribution is exponential. 
The life test is discontinued after a fixed amount 
of total life T has elapsed. Items under test 
may’ or may not be replaced. 

In what follows, let r = number of items which 
failin [0,T]; then some formulas of interest 


are: 
Two-sided 100(1 - a) 
per cent confidence interval for 6 
2T 2T 
2 ah) 


In the important special case where n items are 
tested, with replacement, for a length of time t*, T = nt*. 


APRIL 

One-sided  100(1 - a) 

per cent confidence interval for 6 

@>2 T/X* (ar + 2). (22) 

One-sided  100(1 - a) 

per cent confidence interval for the quantity 
X, = @ log u 
p eho 
1 v2 
Xp > 2 T log yet a 2)ae (23) 
If we define 7 as 

Tae oak. log 5 /X3 (2r +2), (24) 


then we can assert with 100(1- a) per cent con- 
fidence that at least 100p per cent of the items 
survive for a length of time 7. Putting the last 
statement into reliability language we can be 
100(1 - a) per cent confident of the truth of the 
assertion that the reliability over [0,7] is > p. 


Example 6 


30 items are placed on test. Items which fail 
are replaced. The life test is stopped after 100 
hours have elapsed. Five failures were observed 
in the course of the experiment. Assuming that 
the underlying distribution of life is exponential, 
find one- and two-sided 95 per cent confidence 
intervals for 6. 


Solution: In this problem the fixed amount of total 
life observed is T = nt* = 30(100) = 3000. Substi- 


tuting in (21) and using XO 025(12) = 23.337 and 


x 975(10) = 3.247, one gets the two-sided 95 per 


cent confidence interval, 
257 <6 < 1848. 
Substituting in (22) and using xO 05 (12) = 21.026, 


we get the one-sided 95 per cent confidence inter- 
val 


6 > 285. 
Example 7 


Given the data in Example 6, estimate 7 so 
that we will be 95 per cent confident that the 
probability of surviving 7 hours is at least 0.9. 
Substituting in (24), and using T = 3000, r= 5, 
a=0.05, p=0.9, we get 

6000 


T= 21.036 (0 1054) Seyi ties 


On the basis of the data we can be 95 per cent 


1960 


confident that the probability of surviving T = 
30.1 hours is > 0.9. 


Case Ill: n items are placed on life test for a 
time t*. At the end of this time one counts the 
number of items that have failed in [0,t*]. Items 
that fail are not replaced. 

In what follows let r = number of observed 
failures. Then we can make the following non- 
parametric statement. 

We can assert with 100(1 - a) per cent confi- 
dence that at least 100b per cent of the popula- 
tion survives for a length of time t* with b 
given by 

1 SS aD 
1+( \F (2r + 2, 2n - 2r) 
n-=-rj @ 


Put in reliability language, we are 100(1- qa) per 
cent confident of the assertion that the reliability 
over [0,t*] is > b. 

In the particular case where the underlying 
distribution is exponential, a one-sided 100(1 - a) 
per cent confidence interval for 6 is given by 


a> a . (26) 
log {1+ ( = ) F,(2r + 2,2n - 2r)} 


In (25) and (26), F,(2r + 2, 2n - 2r) is the upper 
@ percentage point of the F(2r +2, 2n - 2r) dis- 
tribution. 


Example 8 


20 items are placed on life test for 100 hours. 
Two items fail before this time. Items which fail 
are not replaced. 

a) Make a nonparametric one-sided 95 per 
cent confidence statement about the probability of 
surviving 100 hours. 

b) If the underlying distribution is exponential, 


EPSTEIN: ESTIMATION FROM LIFE TEST DATA 


107 


find a one-sided 95 per cent confidence interval 
for the mean life 4. 


Solution: a) In this problem n= 20, r= 2, a= 
0.05, t* = 100. Since Fo 95(6,36) = 2.36, it fol- 
lows from (25) that b = 0.718. Hence we can 
make the following nonparametric statement. 

We are 95 per cent confident of the assertion 
that the probability of surviving 100 hours (reli- 
ability over 100 hours) is > 0.718. 

b) Substituting in (26) we get the one-sided 95 
per cent confidence interval, 9 > 302. 


Example 9 


Ten thousand one-hour missions are carried 
out. Ten failures are observed. Make a one- 
sided 95 per cent confidence statement about the 
reliability (probability of success) in a one-hour 
mission. 


Solution: In this problem n= 10,000, r = 10. 
Substituting in (25) we get b = 0.9983. We can be 
95 per cent confident that the reliability in a one- 
hour mission is > 0.9983. 


Remark: In carrying out the computations we use 
the fact that since n is large Fg, 95(22; 19980) ~ 
Fo 05(22,°0) = 1,54. 


Example 10 


Suppose that 10,000 one-hour missions are 
carried out and that no failures are observed. 
Find a one-sided 95 per cent confidence interval 
for the probability of mission survival. 


Solution: Substitute in (25) with n= 10,000, r = 0. 
F9.05(2;20,000) ~ Fo,05(2, 00) = 3.00. Hence, 

b = 0.9997. We can have 95 per cent confidence 
in the assertion that the true probability of mis- 
sion survival is > 0.9997. 


108 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


APRIL 


A CUSTOMER LOOKS AT THE RELIABILITY PROGRAM 
ACTIVITIES* 


H, R. POWELL 


INTRODUCTION 


The Space Technology Laboratories of Los 
Angeles has the responsibility for systems engi- 
neering and technical direction of all contractors 
on the Air Force Ballistic Missile Programs. As 
such, we work in close association with the mem- 
bers of the Ballistic Missile Division of the United 
States Air Force, in establishing, implementing, 
and directing the various activities of the contrac- 
tors who are working on these programs. How- 
ever, I have not always represented the military- 
customer side of the picture. I was for many 
years a contractor before becoming associated 
with the present programs. Therefore, my com- 
ments will stem from the combined background of 
both a contractor and a military customer. I will 
attempt, therefore, to put the requirements of the 
vast and complex military and Department of De- 
fense organizations into the language of a con- 
tractor. The remarks contained in the text of this 
paper will be a combination of what I know about 
the requirements and policies of the Department 
of Defense, the United States Air Force, and the 
policies of my own company, STL. 

To say that the Department of Defense and the 
Military Services have a high interest in relia- 
bility would indeed be an understatement. It is 
generally recognized by these agencies that the 
reliability of a system has tremendous implica- 
tions in terms of overall weapon-system effec- 
tiveness, in logistics and manpower, and in terms 
of forced sizing of the various component arms of 
our total military organizations. To say that 
these people are concerned by the overall reli- 
ability record established in the past is also put- 
ting it mildly. I would like to quote from the 
report of the Department of Defense Ad Hoc 
Committee for Guided Missile Reliability, dated 
April, 1958: ‘‘A significant number of guided 
missile weapon systems have failed to achieve the 
required reliability in the time scheduled and at 


*This paper was presented at the First Annual Bay Area. 


Reliability Seminar, Menlo Park, Calif., February 19, 1959. 
{Space Technology Labs., Rano-Wooldridge Corp., Los 
Angeles, Calif. 


the estimated cost. Consequently, military plan- 
ning is upset, funds are expended that might have 
been usefully applied to more productive projects, 
and the operational use of missile systems is 
postponed, thereby diminishing our nation’s mili- 
tary strength.’’* 

The realization of the need for reliability sud- 
denly burst upon us when this country embarked 
on large-scale missile programs to be used in 
our offensive and defensive arsenals, and the em- 
phasis on reliability has increased over the past 
ten years. It is felt, however, that reliability tech- 
nology and achievements have not generally kept 
pace with the requirements. Therefore, there has 
been an evolutionary trend toward the concept of 
spelling out reliability requirements in more spe- 
cific terms in contracts. Contractors can expect 
to see increasingly stringent reliability require- 
ments spelled out in contracts in numerical forms. 
This concept is now part and parcel of the BMD/ 
STL Reliability policy for the ballistic missile 
contractors. Similar requirements are beginning 
to appear in the contracts of the other Services. 

This procedure has several advantages: It 
tells the management of the company just what 
they are expected to deliver in the way of systems 
reliability; it gives the contractors’ own reliability 
organizations something definite to hang their hat 
on and to use as a basis for carrying out their ac- 
tivities; and it gives the military, the customer, 
better assurance that the systems he buys will be 
usable. 


A SYSTEMS-RELIABILITY PROGRAM 


In order to spell out more clearly what a con- 
tractor is expected to do about reliability, it has 
become customary to identify the kinds of activi- 
ties which are considered valid expenditures of 
manpower and money, a measure essential to the 
ultimate achievement of the required reliability. 
The Ballistic Missile Division and the Space 
Technology Laboratories consider the following 


‘Dept. of Defense Ad Hoc Comm. for Guided Missile 
Reliability report, publ. by Office of William Holaday, 
Director of Guided Missiles, Dept. of Defense, Washington, 
1D Ge 


960 POWELL: A CUSTOMER LOOKS AT THE RELIABILITY PROGRAM ACTIVITIES 109 


ontractor activities, each described briefly, to 
e essential to the carrying out of a balanced and 
omplete reliability program: (Of course, the 
St could be lengthened or shortened through a 
ubdivision of some of these program elements, 
r a combination of some, but these would be es- 
entially variations of this same basic reliability 
rogram.) 


) Environmental Conditions Determination 


Missile environment measurements should be 
eviewed, and interactive effects of the contrac- 
or’s hardware on missile environment should be 
stimated. Environmental conditions for assem- 
ly and parts specifications should be determined. 
nformation on environmental conditions should 
e disseminated to designers, specification 
yriters, and test planners. 


) Reliability Apportionment 


Detailed reliability objectives should be estab- 
ished by numerically apportioning the contrac- 
or’s system reliability objective among the com- 
Oonent assemblies and parts. The objectives 
hhould be disseminated for use in design guid- 
nce, in evaluations and comparisons, in planning 
f test programs, and in pinpointing problem 
reas. 


}) Reliability Indoctrination 


The reliability program requirements and 
rocedures should be explained and ‘‘sold’’ to 
ompany personnel and to vendors. Measures 
uch as a training program for engineers should 
ve instituted. 


1) Parts Approval Verification 


The component parts engineering effort (wheth- 
r centralized or not) should be monitored. Qual- 
fication tests and performance data on parts 
should be reviewed, coordinated, and dissemi- 
ated. Information sources such as preferred 
arts lists, data files, and vendor ratings should 
ye maintained. 


}) Specifications Review 


Specification writers should be assisted with 
egard to reliability objectives and statistical 
olerance considerations. Specifications should 
Ye reviewed to assure that reliability will not be 
mduly compromised in such matters as reliabil- 
ty-performance trade-offs. 


)) Design Review 


Designers should be assisted in such matters 
's environment problems, tolerances, component 


application, developmental marginal checking, 
consideration of effects of production tooling 
methods on reliability, and human use factors. 
Designs should be reviewed for such reliability 
factors as adequate safety margins, provision for 
preventive maintenance, and appropriate redun- 
dancy. 


7) Failure Reporting Surveillance 


Procedures should be established and main- 
tained for reporting individual failures in plant 
and field operations involving prototype and pro- 
duction hardware. Individual failure reports 
should receive engineering analysis and be dis- 
tributed to design or production activities for 
prompt correction of troubles. Follow-up should 
be instituted to assure that failures are corrected. 
A card file should be maintained on failure-report 
data. Summary reports should be prepared from 
the card file and distributed, so that problem 
areas and progress may be defined. 


8) Statistical Test Planning 


Test engineers should be assisted so that opti- 
mum consideration is given to environmental test 
conditions, reliability objectives, and statistical 
design of experiments (particularly with regard 
to sample size, stress level, and arrangement of 
tests). Test plans should be reviewed to assure 
that testing will be sufficiently comprehensive to 
allow detection of important modes of failure and 
to provide a basis for effective evaluation. ‘‘Op- 
erating characteristic’’ curves should be pre- 
pared to define the effectiveness of the planned 
tests. 


9) Statistical Test Evaluation 


All test reports should be analyzed and evalu- 
ated for information on hardware capabilities and 
weaknesses. Data and results should be explained 
to designers, and also stored and cross-filed for 
reference so that accumulated information on 
particular designs may be readily located at any 
time. Information should be included on succes- 
ses as well as failures. 


10) Quality Control Coordination 


The quality control effort should be coordinated 
with regard to such matters as reliability objec- 
tives, production process control, and inspection 
test procedures designed to show up incipient 
failures. 


11) Program Data Evaluation 


All information and data obtained in the pro- 
gram should be organized, analyzed, and evaluated 


110 IRE TRANSACTIONS ON RELIABILITY AND CONTROL 


with regard to reliability. Estimates of achieved 
reliability should be made and projected to oper- 
ational use. Reports should be issued to define 
problem areas and report progress. A list of 
critical assemblies and parts should be main- 
tained. A special effort should be made to “close 
the loop’’ by feeding information back into the 
organization promptly and at points where it is 
needed. 


12) Vendor Control 


The contractor shall take steps to ascertain 
through proper tests and surveillance that parts 
and devices supplied by vendors and subcontrac- 
tors are adequate for their intended application 
in the contractors’ equipment. These measures 
shall include tests to demonstrate design capa- 
bility and to provide a continuous monitoring of 
the vendor’s quality control and product improve- 
ment programs. 


13) Flight Test Planning 


Plans and specifications for missile flight 
testing should be reviewed from the point of view 
of obtaining as much information pertinent to 
reliability as possible without causing undue 
compromise or interference with other flight test 
objectives during their R & D program. Instru- 
mentation should be carefully reviewed as to 
adequacy of design and ability to yield data of re- 
quired quality. Special attention should be given 
to the question of whether or not certain tele- 
metering channels should be commutated when the 
reliability of a missile subsystem is being evalu- 
ated by the operating time-to-failure criteria. 
For flights during the latter stages of the R&D 
program and post-IOC flights, inclusion of reli- 
ability objectives and flight test planning is even 
more important, in the sense that flight-test data 
on missiles of operational or near-operational 
design are especially significant in making esti- 
mates of future reliability of operational systems. 


14) Analysis of Test and Flight Failures 


All failures resulting from environmental, 
factory, field, and flight test should be analyzed 
by people in the reliability organization to deter- 
mine the significance of the failures in a reli- 
ability sense. The analysis should be on both a 
statistical and engineering basis. An attempt 
should be made to determine causes of the fail- 
ures and, where feasible, whether or not the de- 
ficiency was due to design, quality control, or 
human factor. Where appropriate, these analyses 
should be fed back through the failure-reporting 
system described in section 7, above. 


APRIL 


15) Determination of Corrective Action 


Analysis of test results and failures should 
have two primary purposes: To determine the 
need for corrective action, and to establish at 
least a recommendation as to the nature of the 
corrective action required. 


16) Corrective Action Follow-up 


This constitutes one of the most critical of all 
reliability activities. Procedures should be set 
up which will allow for a routine follow-up of cor-. 
rective actions recommended or being acted upon, 
and should contain a system of checks and bal- 
ances to assure that the required corrective ac- 
tions have not been forgotten or ignored by those 
responsible for implementing the final action. 


It is recognized that the interpretation and im-: 
plementation of these requirements may vary 
somewhat between contractors, depending upon 
the status of existing contracts, level and severity, 
of the reliability requirement on that contractor, 
state of the art of the technical field in which he 
is working, and the contractor’s own organization, 
policies, and procedures. However, he is expect-: 
ed to show that these activities are being carried 
out in one form or another. | 

The question of contractor organization is con- 
sidered a very important point. Several things 
are expected of the contractor in this respect. 
First of all, he must have a separate and distinct 
reliability organization adequately staffed with 
technically competent people. Secondly, the or- 
ganization should be vested with certain in-line 
responsibilities for carrying out a meaningful 
program. Finally, it should report sufficiently 
high up in the company’s organization to give it 
authority and management backing. 


GOVERNMENT REPORTS ON RELIABILITY 


Two of the most significant of the recent docu-- 
ments to come out of the Department of Defense 
on the subject of reliability have been the reports 
of the Advisory Groups on Reliability of Electron- 
ic Equipment, called AGREE, and the report of the 
Ad Hoc Committee on Guided Missile Reliability. 
It should be noted that many of the activities de- 
scribed above coincide with activities recom- 
mended by AGREE and the ACGMR. The AGREE 
and ACGMR reports were intended to complement: 
each other and not to duplicate each other’s work.. 
The AGREE report is a collection of detailed 
technical procedures for dealing with many of the: 
specific problems which arise in carrying out an 
active reliability program. The ACGMR report, 


960 


on the other hand, is considered to be more of a 
anagement-type document to be used by both 
ontractor-management and military-manage- 
ment in guiding the overall efforts in the relia- 
ility programs under their cognizance. Thus, 
he AGREE and ACGMR reports form an excel- 
ent team; one the detailed technical procedures 
o be used in reliability; the other the manage- 
ment procedures to be used. In general, it has 
een found that those contractors who conscien- 
iously carry out programs of the kind described 
bove show very promising results. 

Well known, by now, is the investigation which 
as carried out during the latter half of 1958 by a 
scientific study group under the auspices of the 
douse Appropriations Subcommittee. (Mahon 
ommittee.) This investigation apparently 
stemmed from a feeling of apprehension on the 
part of our national leaders as to whether or not 
ihe vast sums of money which are being spent on 
ur missile and other weapons systems programs 
re producing weapon systems which are reliable. 
the results of the Study Groups’ activities have 
ot been made public at this time. However, we 
tan anticipate that the results of this investigation 
d study will be seen in increased pressure on 
everyone from the Department of Defense down 
hrough the Military Services and eventually to 
he contractors, for the attainment of a higher 


POWELL: A CUSTOMER LOOKS AT THE RELIABILITY PROGRAM ACTIVITIES 


ae 


reliability than has been demonstrated in the past. 


1) 
2) 
3) 
4) 


5) 


6) 


7) 


8) 


CONCLUSION 


The push is on for higher reliability and 
increased efforts on the part of contractors. 
The military customers do not want lip 
service alone. 

Contractors are expected to have a distinct 
and recognizable reliability organization. 
The organization to be effective must have 
in-line responsibility and authority. 

There must be an actual reliability pro- 
gram in being with definite reliability ac- 
tivities being carried out, not just a paper 
program which rationalizes the problem. 
There must be management support for 
reliability. 

The military customers encourage original 
work on the part of the contractors to de- 
velop their techniques for analyzing and 
understanding the reliability problem and 
for generating solutions to the problem. 
The military customers encourage more 
active interchange of information between 
contractors, not just through symposia and 
seminars, although these are very valuable, 
but also through more direct contact and 
discussion with each other. 


112 IRE TRANSACTIONS ON RELIABILITY AND CONTROL APRI 


CORRECTION TO “‘MODULE PREDICTION’ 
George Hauser, author of ‘‘Module Prediction,”’ 
which appeared on pages 53-63 of the June, 1959 el 
issue of these TRANSACTIONS, has requested that — ve - 
the following corrections be made to his paper. ry ye 
Eq. (1), page 56, should read Bae \ 2 ore 
A ituB “uy es ore? 


Eq. (2): page 56, should read 


oA es Si es 7 y rr: oe ; me epi 1p 


es 


; ane me ‘sts Ts 


\ 
Mt 


\' 


yy 2 


Ve 


Boe ee 2 og SAVAIL ABLES NOME 


Your Copy ot . 
Production and Field Reliability 
z Edited by the _ 
“Technical Publications Committee of the Electronics Division of the 
American Society for Quality Control 


‘ot | 
aris Sh 


eee | Begins with basic reliability principles and theory, progresses logically through reliability 
and quality controls applied during the sequential steps of production and field use. 
z ey TABLE OF CONTENTS | 

1. Reliability Concepts 7. Process Controls 

2. Terms and Definitions } 8. Unit and System Tests 

3. Reliability Mathematics s 9. Maintenance Aspects 

4. Supplier Controls =<) 10, Field Failure Reporting 

5. Supplier Rating _ 11. Field Failure Analysis — 

6. Receiving Inspection and Test 12. Field Evaluation 


A single copy of the Production and Field Reliability Manual costs only $3.50. Address your 
order now to: Re ete ’ 

J. Bemersderfer 

ASQC Electronics Division Treasurer 

General Electric Co. 

Building 500 

Cincinnati 15, Ohio 


Be sure to enclose your name and address and $3.50 for each copy eed Make all checks 
payable to the ASQC, Electronics Division. 


